Press "Enter" to skip to content

Month: July 2022

Text Clustering with Python

Luke Menzies takes us through the gensim library:

An interesting branch of machine learning is Natural Language Processing (NLP). As the name suggests, it involves training machines to detect patterns in language using algorithms. It is quite often the case that NLP is referred to as text analytics. It is actually more impressive than that. It examines vectorised patterns which not only looks at the positioning of elements but what it means in context to neighbouring elements within the vector. In a nutshell, this technique can be extended beyond text to patterns of linguistics in general and even contextual patterns. Nevertheless, its primary use in the machine learning world is to analyse text.

This article will focus on an interesting application of NLP which involves the clustering of text. Clustering is a popular unsupervised machine learning technique used for segmentation or grouping of data. It is a very powerful tool that is used across a variety of industries. However, it is rare you hear of applying clustering to text. This can be achieved using NLP functions, combined with clustering algorithms that can handle non-Euclidian distances.

Read on for an overview of the process and an example of combining DBSCAN with word2vec to cluster phrases.

Comments closed

Query Splitting in Entity Framework

Guy Glantser doesn’t pull punches:

Recently, while working with a customer and tuning some queries, we spotted a query that seemed odd. Something about it wasn’t right. After some more investigation, the developer recognized the query as a one generated by Entity-Framework and using the SplitQuery feature. This was new to me. It’s the first time I encountered this feature, so I went to learn about it.

Now that I know what it is and how it works, I can tell you that it’s a terrible feature in most cases. Developers should avoid using it, unless there is a good reason to use it (which I doubt).

I thought based on the title that this was something totally different. Reading what Guy has to say about it, I fully agree.

Comments closed

Target Areas on a Line Chart

Mara Pereira adds target bands to a line chart:

At the time I could not really find an easy way to achieve this… Until error bars came out!

Don’t be fooled though, it’s still a bit tricky to build a line chart like this, however I found it way easier now than before.

So, you must be thinking now “how did you do that?”.

Well, let’s find out!

The end result looks really nice, though it takes a lot of work to get there.

Comments closed

Data Quality Checks in Power BI

Kristyna Hughes wants to match up data:

Picture this, you have a report in Power BI that someone passes off to you for data quality checks. There are a few ways to make sure your measures match what is in the source data system, but for this demo we are going to use python and excel to perform our data quality checks in one batch. In order to do that, we are going to build a python script that can run Power BI REST APIs, connect to a SQL Server, and connect to Excel to grab the formulas and to push back the quality check into Excel for final review. To find a sample Excel and the final python script, please refer to my GitHub.

Check out the GitHub repo as well as Kristyna’s very detailed walkthrough.

Comments closed

SQL Server 2022 CTP 2.1 Released

Ajay Jagannathan has a good announcement:

Continuing with our release cadence, we’re excited to announce the release of SQL Server 2022 Community Technology Preview 2.1. Since the first public preview in May 2022, anyone can download SQL Server 2022 CTP2.1 to try the new features in this release.

CETAS and delta table support are nice additions for PolyBase, ones I’ve really wanted on-premises. We also have the official releases of APPROX_COUNT_DISC() and APPROX_COUNT_CONT(), which I can confirm are “good enough” in terms of closeness and way faster than doing COUNT(*). If you don’t need exact numbers (and outside of certain financial or legal scenarios, once you get into the millions or billions, you usually don’t need a precise number, just a sufficiently good estimate).

Comments closed

Azure Data Studio July 2022 Release

Timi Oshin announces a new set of updates:

The Query Plan Viewer feature continues to add functionality with this release of Azure Data Studio. There are several UX improvements users may notice: the icon to enable the capture of an actual plan has been updated, operator selection is now noted with a solid green line, and the plan labels are updated in the Properties window when plans are compared and the orientation is toggled from horizontal to vertical, and back.  We have updated the Command Palette to make it easier to find the commands for execution plans, and while the CTRL + M command still enables actual plan capture for a query window, it no longer executes the selected query (or queries) in the window. 

It’s not a huge release in terms of new functionality but there are some improvements to the query plan viewer and its core Visual Studio Code implementation.

Comments closed

Hosting an App on RStudio Connect

Liam Kalita wraps up a series:

So far, we have seen how to create an app using ReactJS and and a Plumber API. In part 3, we will show you how to host the application on RStudio Connect (RSC)!

When it comes to hosting the application on RSC we will set the content URL for both the app and API so that they are in the same domain and won’t have this CORS issue.

Read the whole thing.

Comments closed

Running Diagnostic Notebooks via Powershell

Tracy Boggiano kicks off a notebook:

As part of starting a new job you need a way to get a good inventory of basic information of SQL Server instances.  Once you have done what I outlined in this blog post.  I find it helpful to run Glenn Alan Berry’s Diagnostic Notebooks against all the instances to get a static point in time snapshot of all the properties and some performance information.  While dbatools has commands under the Community Tools section for running the data into spreadsheets and creating notebooks for the newest queries I like to go get Glenn’s because he has all the comments in there of what the mean and links to resources about things.  So you can explore that route if you like but I’ll be manually downloading them from Glenn’s site for that reason.  To able to open the notebooks successfully in ADS look for the tip on my blog post on Tools I Use on My Jumpbox for opening large notebooks.

Click through for a script Tracy uses to kick off the notebook regardless of the SQL Server version.

Comments closed

Updating Synapse Linked SQL Servers with Azure DevOps

Kevin Chant makes a change:

This post covers how to update both ends of Azure Synapse Link for SQL Server 2022 using Azure DevOps. As shown at the Data Toboggan conference.

By the end of this post you will know how to deploy database updates to both the SQL Server database and the Azure Synapse dedicated SQL Pool that are used as part of Azure Synapse Link for SQL Server 2022, using a pipeline in Azure DevOps. To keep them consistent.

Click through for the process.

Comments closed