Press "Enter" to skip to content

Curated SQL Posts

Finding Duplicate Images with Filetable

Louis Davidson continues a discussion of SQL Server’s filetable feature:

In my article “Filetables, Great For Business, Great For Hobbyist”, I implemented the basis of a hobby database I use for my Twitter accounts (@disneypicaday and @dollywoodp), where I post daily pictures of theme parks.

As my inventory of prepped pictures has grown, one thing has become a major pain. Duplicates. I have tens of thousands of pictures, and I scan them occasionally to add to my directories. Perhaps not surprisingly, the same pictures often get chosen a second time. I use a naming standard that integrates the files into my database, as well as for the copies to go out for a given day. So the second (and third and fourth) time I choose the same picture, it looks different, though it has the exact same bits.

There are some nice benefits to maintaining file metadata in a queryable format, though do read to the end to see the performance impact.

Comments closed

The Costs and Benefits of Dirty Reads

Chad Callihan explains what a dirty read is and does a cost-benefit analysis on it:

When you are not careful with your transaction isolation levels or you get sneaky with the NOLOCK hint, one problem you can encounter is a dirty read. Let’s look at a short example to demonstrate a dirty read.

In a vacuum, I’m not necessarily opposed to the idea of dirty reads because you can find legitimate cases in which they can be useful. In practice, I’m generally very much in opposition because of two reasons: first, Read Committed Snapshot Isolation eliminates the majority of those reasons; and second, because the misuse is almost always in the direction of over-use of NOLOCK hints.

Comments closed

Estimating Nonclustered Index Sizes with Powershell

Alex Stuart performs some calculations:

I recently encountered a requirement to estimate the size of (a lot of) nonclustered indexes on some very large tables due to not having a test box to create them on or the time to build one. I couldn’t find a script to do this, and as any programmer knows, laziness is the real mother of invention, so I wrote one.
This post summarises how we can calculate this and provides a Powershell function to do so.

I used Microsoft’s documentation as a basis on how to do it, but as it’s a wall of text that’s a little tricky to follow, I’ll go over the basics here. I’m only covering the leaf levels and non-MAX columns that would create LOB pages – I’ll explain why later.

Click through for the article and to see how Alex’s calculations play out.

Comments closed

A Warning: VPCs and Distributed Database Platforms

Wade Trimmer takes us through a reason why you might not want to use VPC endpoints to separate applications from distributed database platforms:

AWS PrivateLink (also known as a VPC endpoint) is a technology that allows the user to securely access services using a private IP address. It is not recommended to configure an AWS PrivateLink connection with Apache Kafka or Apache Cassandra mainly due to a single entry point problem. PrivateLink only exposes a single IP to the user and requires a load balancer between the user and the service. Realistically, the user would need to have an individual VPC endpoint per node, which is expensive and may not work. 

Using PrivateLink, it is impossible by design to contact specific IPs within a VPC in the same way you can with VPC peering. VPC peering allows a connection between two VPCs, while PrivateLink publishes an endpoint that others can connect to from their own VPC.

Read on to understand how this affects platforms like Cassandra and Kafka.

Comments closed

Fun with Excel Cube Functions

Chris Webb has a new series on Excel cube functions combined with Lambda helper functions. First up, Chris collects items from a set:

This example shows the fundamental problem that has always existed with CubeRankedMember though: in order to show all the items in a set you need to know how many items there are in advance, and populate as many cells with CubeRankedMember formulas as there are items. In this case see how the range B4:B6 contains the numbers 1, 2 and 3; these numbers are used in the formulas in the range C4:C6 to get the first, second and third items in the set.

If a fourth product was added to the table, however, it would not appear automatically – you would have to add another cell with another CubeRankedMember formula in it manually. I’ve seen some workarounds but they’re a bit hacky and require you to know what the maximum possible number of items in a set could ever be. Indeed that’s always been one of the key differences between cube functions and PivotTables: cube functions are static whereas PivotTables can grow and shrink dynamically when the data changes.

Read on for a better answer. Then, check out part 2, which covers generating a dynamic table using Excel cube functions and Lambda helpers:

In the first post in this series I showed how to use the new Excel Lambda helper functions to return an array containing all the items in a set. That isn’t very useful on its own, so in this post I’ll show you how to generate an entire dynamic table using Excel cube functions and Lambda helper functions.

In this post I’ll be using the same source data as in my previous post: a table containing sales data with just two columns.

Comments closed

Configuring Classifiers for Workload Management

Reiss McSporran continues a series on workload management in Azure Synapse Analytics:

So in part 1 we found out what Workload Management was and why we wanted to use it. (If you haven’t read that part, please click here and read that post before you carry on, as we’re jumping in at the deep end!) In this part we’ll be learning how to configure it.

Continuing from the end of the previous post, you have two main options to configure this. First is to assign a user or role to one of the existing system defined workload groups, second is to create a custom workload group and assign a user or role to this instead.

Read on to learn how to create classifiers, what the rules look like for them, and their importance.

Comments closed

Creating a Time Zone Dimension

Jonathan Kehayias has a follow-up to a prior post:

The basis for the approach in this post is going to be the TimeZoneInfo class from .NET, which is “supposed” to be up to date with any time zone DST rules from the Windows Registry. I used a modification of the PowerShell example in the .NET Docs page for the TimeZoneInfo class to build a a flat file that is then bulk loaded to a table in SQL Server using BCP with all the time zone adjustment rules:

There’s a lot of effort in here, but as Jonathan notes, it will provide you the ability to get the right time zone as of a particular date and time. Or at least, maybe—I’m not sure if the Windows registry includes historical time zone shifts such as when the US changed when daylight savings time starts and ends.

Comments closed

Putting a SQL Server Database into Recovery Pending

John Morehouse causes harm to a perfectly good database:

You might be asking why on earth would you want to get a database into an undesirable state, more specifically into a Recovery Pending state.  Well, in my case, I had a client database that had entered into this state due to a failure of storage.  Thankfully, this database was not being used for production, so I had some time to determine the best way to fix the issue.

A phrase that was often used during my time in the fire service, was “Try Before You Pry”.  Does the front door of the house need to be kicked in?  It may be unlocked and by trying before prying (with my boot) I can prevent damage to the door.  In these types of scenarios, this philosophy holds true.  Try things out on non-critical databases will help prevent any further damage.

In this instance, I want to try it before forcing something that might be damaging.  This meant I had to get a test database into a recovering state.  Once it is in the state I need, then I can attempt different methods to recover the database properly.  Once a successful solution is determined, I can then confidently pry against the damaged database in production knowing that I am using a validated solution.

Read on to learn one way to put a database into recovery pending state, but definitely pay attention to the disclaimer. And its twin.

Comments closed

Working with App Secrets in .NET Core

Santosh Hari shows us how to use application secrets when building .NET Core applications:

I was writing a sample dotnetcore console application for a talk because why I felt using a sample aspnet core web app was overkill. The app was connecting to a bunch of Azure cloud and 3rd party services (think Twilio API for SMS or LaunchDarkly API for Feature Flags) and I had to deal with connection strings.

Now I have a nasty habit of “accidentally” checking in connection string and secrets into public GitHub repositories, so I wanted to do this right from the get go.

That’s a bad habit to be in, and Santosh shows us how we can avoid doing that via use of application secrets.

Comments closed

An Introduction to ML.NET

Ivan Matec gives us a walkthrough of the ML.NET library and its Model Builder component:

Before we dive into our example, let’s talk a bit about ML.NET history and its current state.

ML.NET draws its origins from the 2002’s Microsoft Research project named TMSN, which stands for “test mining search and navigation.” Later it was renamed to TLC, “the learning code.” ML.NET war derived from the TLC library. Initially, it was used on internal Microsoft products.

The first publicly available version ML.NET 1.0 was released in 2019. It included the Model Builder add-in and AutoML (Automated Machine Learning) capabilities.

The current version is 1.6.0. More details about all releases can be found on the official ML.NET release page.

ML.NET is not a bad library if you need to do some fairly simple work

Comments closed