Press "Enter" to skip to content

Author: Kevin Feasel

Distance Between Strings: Levenshtein Distance

Nikhil Babar has an introduction to the Levenshtein distance algorithm:

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who discovered this equation in 1965.

Levenshtein distance may also be referred to as edit distance, although it may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.

Read on for an explanation and example.  Levenshtein is a great way of calculating string similarities, possibly helping you with tasks like data cleansing by finding typos or alternate spellings, or matching down parts of street addresses.

Comments closed

Unused Indexes Might Not Be

Tara Kizer has a warning for people eager to drop “unused” indexes:

About 10 years ago, I decided to drop an unused index on a table that had 2 billion rows. The database was around 7TB in size. We were having storage and performance issues. I thought I could help the system out if I were to drop the index.

4 days after I dropped the index, I got a call from our NOC that CPU utilization was pegged at 100% for an hour so they were reaching out to the on-call DBA to check it out. I logged in and saw a query using A LOT of CPU. Its WHERE clause matched the index I had dropped. The execution plan showed it was scanning the table.

It turned out that I only had 2 weeks of uptime, which didn’t include the 1st of the month. The query that was causing the high CPU utilization was a report that ran on the 1st of each month.

Tara has also provided us with a script to track these details over time, so check that out.

Comments closed

Azure SQL Managed Instance Inventory Analysis

Jovan Popovic shows us how to use the Azure CLI plus JSON support in SQL Server to manage a list of Azure SQL Managed Instances:

Sometime you would need to know how many Managed Instance you have created in Azure cloud. Although you can find all information about the Azure SQL Managed Instances in Azure portal or API (ARM, PowerShell, Azure CLI), sometime it is hard to list all instances and search them using some criteria. In this post you will see how easily you can load list of your Managed Instances and build inventory of your resources.

Problem

Imagine that you have a large number of Managed Instances and you need to know how many instances you have, in what regions, subnets, and virtual networks they are placed, how much compute and storage is allocated to each of them, etc. Analyzing inventory of Managed Instances might be hard if you just use PowerShell.

Click through for the solution.

Comments closed

Testing Azure SQL Database Failover

Arun Sirpal shows us how easy it is to fail over Azure SQL Database to another region:

So what happens now if I connect to the read/write endpoint? (I test this via SSMS)

The dreaded IP address / create a new firewall rule message. Why? Well this setup utilized a “server” level firewall rule and the server in the US did NOT have the IP address mapped in, you can see from the below screen shot that there are no firewall rules configured.

Fixing this is easy, you could just add the IP address on the secondary server as another server level rule but you should seriously consider using a database level firewall rule, the setup will get replicated to the secondary server making failover experience smoother.

Read on to see how to set this.

Comments closed

Checking AD Group Membership

Matthew McGiffen shows us a way to test if an account is a member of an Active Directory group:

As part of my job I manage a bunch of SQL instances for Development and Test.

Access is managed though Active Directory groups, so I rarely have to do anything regards managing permissions. Nonetheless I often get requests from people to give them access. This is usually for a new starter or someone who has moved from one team to another.

Of course, the answer is usually that they just need adding to the right AD group. Rather than assume though, I always get them to check before I pass the request on to the AD team. You never know, there could be something else wrong.

You can also use xp_logininfo if you want to go the other way and get all of the members of a particular group.

Comments closed

Dealing With International Addresses

Danielle Grenier shows us how Etsy handles internationalization of addresses:

It starts off easy – you fill in your name, your street, your apartment number, and then you reach the field labelled “Post Town”. What is a “Post Town”? Huh. Next you see “County”. Well, you know what a county is, but since when do you list it in your address? Then there’s “Postal code”. You might recognize it as what the US calls a “zip code”, but it’s still confusing to see.

So now you don’t really know what to do, right? Where do you put your city, or your state? Do you just cram your address into the form however you can and hope that you get your order? Or do you abandon your cart and decide not to buy anything from this site?

This is more application-level code than data (and so glides over the underlying database work), but Danielle does link to some good resources for figuring out what non-US addresses look like.

Comments closed

Packages And Functions In R

Ellen Talbot walks us through some of the basics of using packages and functions in R:

A function does some computation on an object. The use of a function consists of:

  1. A function’s name
  2. Parentheses
  3. 0 or more inputs

Each input is provided to an argument or parameter within a function.

These arguments have names, although you don’t often need to provide the names.

You can find out what arguments a function takes by using the code completion and it’s help snippet, or by searching for the function in the Rstudio Help tab.

When you’re inside the brackets of a function you can get the list of available arguments and auto-complete them.

Ellen also includes some useful R libraries for working with and visualizing data.

Comments closed

Building A Basic Kafka Producer

M. Mallikarjun shows us a simple producer in Kafka:

A Kafka producer is an application that can act as a source of data in a Kafka cluster. A producer can publish messages to one or more Kafka topics.

So, how many ways are there to implement a Kafka producer? Well, there are a lot! But in this article, we shall walk you through two ways.

  1. Kafka Command Line Tools
  2. Kafka Producer Java API

You can write producers in quite a few languages.  Java is the example here, but there are several libraries, including a good one for .NET.

Comments closed

What You Can Learn At SQL Saturday

Nate Johnson shares a few things he picked up at the SQL Saturday in San Diego:

This was an interesting and even slightly entertaining session presented by Max @ SQLHA. One analogy that really stood out to me was this:

SANs have become a bit like the printer industry — You don’t pay a lot for the enclosure, the device itself, i.e. the SAN box & software; but you pay through the nose for ‘refills’, i.e. the drives that your SAN vendor gods deem worthy of their enclosure.

It’s frighteningly accurate. Ask your storage admin what it costs to add a single drive (or pair of drives, if you’re using something with built-in redundancy) to your SAN. Then compare that cost with the same exact drive off the retail market. It’s highway robbery. And we’re letting them get away with it because we can’t evolve fast enough to take advantage of storage virtualization tech (S2DSOFSRDMA) that effectively makes servers with locally attached SSDs a superior architecture. (As long as they’re not using a horribly outdated interface like SAS!)

Nate also includes several more interesting lessons.  SQL Saturdays are great for picking up useful knowledge.

1 Comment

How Join Hints Affect Adaptive Joins

Grant Fritchey looks at the combination of adaptive joins and query hints which specify join type:

I’ve highlighted the interesting bit. “Actual Number of Locally Aggregated Rows” is part of aggregation push down, explained by the amazing Niko Negebauer here and here. Basically, the aggregation is occurring with the data access. So while there is a Hash Match operator for the aggregation, actually, the active part of the aggregation was performed within the columnstore. That’s why the Actual Number of Rows coming out of the columnstore index itself is 0, but the number of rows coming out of the Hash Match Aggregate is 441.

So… why not another aggregate push down when we used the hint? Because the hint says, we MUST use a hash join. At that point the optimizer has no choices on where, when, how it does data processing. It must, first, ensure that a hash join is used, so it does. First thing out of the gate, hash join. Then a hash aggregate. This difference in behavior results in a 24% decrease in performance. The only interesting thing is that the reads remained consistent. This means that it was just the processing of the join that added overhead.

Read the whole thing.

Comments closed