Press "Enter" to skip to content

Day: July 28, 2022

Text Clustering with Python

Luke Menzies takes us through the gensim library:

An interesting branch of machine learning is Natural Language Processing (NLP). As the name suggests, it involves training machines to detect patterns in language using algorithms. It is quite often the case that NLP is referred to as text analytics. It is actually more impressive than that. It examines vectorised patterns which not only looks at the positioning of elements but what it means in context to neighbouring elements within the vector. In a nutshell, this technique can be extended beyond text to patterns of linguistics in general and even contextual patterns. Nevertheless, its primary use in the machine learning world is to analyse text.

This article will focus on an interesting application of NLP which involves the clustering of text. Clustering is a popular unsupervised machine learning technique used for segmentation or grouping of data. It is a very powerful tool that is used across a variety of industries. However, it is rare you hear of applying clustering to text. This can be achieved using NLP functions, combined with clustering algorithms that can handle non-Euclidian distances.

Read on for an overview of the process and an example of combining DBSCAN with word2vec to cluster phrases.

Comments closed

Query Splitting in Entity Framework

Guy Glantser doesn’t pull punches:

Recently, while working with a customer and tuning some queries, we spotted a query that seemed odd. Something about it wasn’t right. After some more investigation, the developer recognized the query as a one generated by Entity-Framework and using the SplitQuery feature. This was new to me. It’s the first time I encountered this feature, so I went to learn about it.

Now that I know what it is and how it works, I can tell you that it’s a terrible feature in most cases. Developers should avoid using it, unless there is a good reason to use it (which I doubt).

I thought based on the title that this was something totally different. Reading what Guy has to say about it, I fully agree.

Comments closed

Target Areas on a Line Chart

Mara Pereira adds target bands to a line chart:

At the time I could not really find an easy way to achieve this… Until error bars came out!

Don’t be fooled though, it’s still a bit tricky to build a line chart like this, however I found it way easier now than before.

So, you must be thinking now “how did you do that?”.

Well, let’s find out!

The end result looks really nice, though it takes a lot of work to get there.

Comments closed

Data Quality Checks in Power BI

Kristyna Hughes wants to match up data:

Picture this, you have a report in Power BI that someone passes off to you for data quality checks. There are a few ways to make sure your measures match what is in the source data system, but for this demo we are going to use python and excel to perform our data quality checks in one batch. In order to do that, we are going to build a python script that can run Power BI REST APIs, connect to a SQL Server, and connect to Excel to grab the formulas and to push back the quality check into Excel for final review. To find a sample Excel and the final python script, please refer to my GitHub.

Check out the GitHub repo as well as Kristyna’s very detailed walkthrough.

Comments closed

SQL Server 2022 CTP 2.1 Released

Ajay Jagannathan has a good announcement:

Continuing with our release cadence, we’re excited to announce the release of SQL Server 2022 Community Technology Preview 2.1. Since the first public preview in May 2022, anyone can download SQL Server 2022 CTP2.1 to try the new features in this release.

CETAS and delta table support are nice additions for PolyBase, ones I’ve really wanted on-premises. We also have the official releases of APPROX_COUNT_DISC() and APPROX_COUNT_CONT(), which I can confirm are “good enough” in terms of closeness and way faster than doing COUNT(*). If you don’t need exact numbers (and outside of certain financial or legal scenarios, once you get into the millions or billions, you usually don’t need a precise number, just a sufficiently good estimate).

Comments closed

Azure Data Studio July 2022 Release

Timi Oshin announces a new set of updates:

The Query Plan Viewer feature continues to add functionality with this release of Azure Data Studio. There are several UX improvements users may notice: the icon to enable the capture of an actual plan has been updated, operator selection is now noted with a solid green line, and the plan labels are updated in the Properties window when plans are compared and the orientation is toggled from horizontal to vertical, and back.  We have updated the Command Palette to make it easier to find the commands for execution plans, and while the CTRL + M command still enables actual plan capture for a query window, it no longer executes the selected query (or queries) in the window. 

It’s not a huge release in terms of new functionality but there are some improvements to the query plan viewer and its core Visual Studio Code implementation.

Comments closed