Press "Enter" to skip to content

Month: December 2024

Cosine Similarity in Power Query

John Kerski searches for similar sets:

I’ll admit upfront—I am not a data scientist by trade. Instead, I’ve picked up my data science skills over time, learning through a combination of osmosis from talented colleagues and tackling real-world data challenges. It’s been a journey of trial, error, and refinement, as I’ve worked to bridge gaps between complex data science techniques and tools available to me.

Recently, my skills were put to the test when I needed to compare hundreds of Active Directory and SharePoint Groups to find similarities in their memberships. With only Power Query available in the production environment, no Python or R to ease the process, I faced the task of finding a method to finding similarities from scratch in Power Query. In this guide, I’ll walk you through the solution I developed, highlighting the steps that made it possible.

John came up with a very clever solution. By the way, the way I like to explain cosine similarity (as a concept, not the algorithm itself) is as follows.

Back in high school physics, you probably drew vectors and learned that vectors have a direction and a magnitude (length). We drew vectors in two-dimensional space because that’s easy: it’s a line on a sheet of paper and there’s an arrow at the end to denote the direction of that vector. Conceptually, vectors with more than two dimensions behave exactly the same; the difference is that we cannot simply draw them, especially once we get past three-dimensional space (a vector with three elements). But the concept is still there: every vector has a direction and a magnitude.

We use cosine similarity to compare two vectors and see how close those two vectors are in terms of angle (direction), with the idea being that magnitude isn’t as important as angle for determining vector similarity. This is in contrast to another technique like Euclidean distance, which focuses more on the magnitude of the vectors versus angle.

Leave a Comment

Availability Group Seeding and Transient Failure 108

Chad Callihan runs into an error with an availability group:

The availability group in question was unhealthy, and none of the added databases were syncing. By the time I started investigating, the SQL service on the secondary had been restarted. There were also no recent errors in Failover Cluster Manager.

I checked the SQL Server Error Log and found some clues. The SQL Server Error Log was filled with “Always On: DebugTraceVarArgs” errors for each database that included the message:

“Seeding encountered a transient failure ‘108’, retrying…”

Read on to see how Chad fixed this.

Leave a Comment

Azure SQL Managed Instance Extreme Storage Latency

Kendra Little has another caveat emptor message:

What are your stories of unbelievably bad performance from cloud vendors? I’ll go first. For years, Azure SQL Managed Instance’s General Purpose Tier has documented “approximate” storage latency as being “5-10 ms.” This week they added a footnote: “This is an average range. Although the vast majority of IO request durations will fall under the top of the range, outliers which exceed the range are possible.”

How approximate is that 5-10 milliseconds, you might wonder? If you use Azure SQL Managed Instance these days, you will regularly find messages in your SQL Server Error log indicating that all data and log files have experienced latency of up to 60 seconds. At least, 60 seconds is the maximum I’ve observed personally, looking in the logs of several customers’ Managed Instances. Could it be worse? Microsoft hasn’t documented a ceiling. My testing shows that this latency occurs randomly to your workload and is not related to your resource usage: using less IO will not make the errors less likely. You have no way to avoid these storage failures (I don’t see how 15-60 second latency is not a failure), and they can occur anytime.

This is a major strike against SQL Managed Instance General Purpose. Considering the cost of the product, you could buy a new server with direct-attached NVMe storage, have it paid off after one year, have better performance, and get to depreciate the entire expense over a 3-5 year window (something you cannot do with the hardware side of cloud services–you can only depreciate the cost of licensing, assuming you have a 3-year reservation).

2 Comments

Using na.rm in R

Steven Sanderson handles missing information in the best way possible—by ignoring it:

Missing values are a common challenge in data analysis, and R provides robust tools for handling them. The na.rm parameter is one of R’s most essential features for managing NA values in your data. This comprehensive guide will walk you through everything you need to know about using na.rm effectively in your R programming journey.

Read on for several examples of how na.rm works.

Comments closed

Fine-Tuning an Azure AI Model

Tomaz Kastrun updates a generative AI model:

Fine-tuning is the process of optimizing a pretrained model by training it on your specific dataset, which often contains more examples than you can typically fit in a prompt. Fine-tuning helps you achieve higher quality results for specific tasks, save on token costs with shorter prompts, and improve request latency.

Read on to see how you can do this. Note that you’ll need to set up the fine-tuning data in a particular format for whatever model you’re using.

Comments closed

Automating Backups with Minion Backup

I have a new video:

In this video, we will learn how to use Minion Backup to automate backup operations, including setting backup frequency, dynamic backup tuning, and scripting out database restorations.

This video wraps up a series on backup and recovery in SQL Server. Now I have to come up with some new videos.

Comments closed

Column Encryption in SQL Server

Rick Dobson encrypts a column:

I was recently assigned to work on a team who will implement column-level encryption and decryption solutions in SQL Server. Please introduce the basics of SQL Server encryption and decryption features. Also, provide a series of T-SQL samples for encrypting and decrypting data within the columns of a SQL Server table.

Always Encrypted can be better for the job, but sometimes, development or product limitations require using old-school column-level (or “cell-level” as people have called it) encryption. And it still works fine for the purpose of encrypting important data at rest.

Comments closed

Data Professional Annual Survey

Brent Ozar is canvassing for survey participants:

Every year, I run a salary survey to help folks have better discussions with their managers about salaries, benefits, and career progression.

Take the survey now here.

The anonymous survey closes Sunday, January 12th. On Tuesday the 14th, I’ll publish the overall responses on the blog in Excel format so you can do slicing & dicing.

Please do fill out the survey. There are enough years of data at this point that we can do some interesting historical trending with it.

Comments closed

Linked Servers into Azure

Andy Brownsword goes old-school:

Connecting different versions of SQL Server can allow us to combine or transfer data between environments. This can become a challenge when the versions are really different.

Have you tried to connect SQL Server 2008 to a SQL database in Azure? – it can throw up a few curve balls.

In this post we’ll look at how to solve 3 of the issues you might come up against.

When reading the title, my first response was, “But why not use PolyBase?” Then Andy threw the SQL Server 2008 bit at me, and then my response was, “But why not use a product that isn’t nearly old enough to vote?”

Nonetheless, Andy does a great job of demonstrating how this would work, and it can work for later versions of SQL Server as well.

Comments closed