Press "Enter" to skip to content

Curated SQL Posts

Azure Open Datasets

Jen Stirrup is pleased that Microsoft is bringing back open datasets:

Nearly three years ago, I complained bitterly about the demise of Windows Datamarket, which aimed to provide free, stock datasets for any and every purpose. I was a huge fan of the date dimension and  the geography dimension, since they really helped me to get started with data warehousing.

So I’m glad to say that the concept is back, revamped and rebuilt for the data scientists today. Azure Open Datasets will be useful to anyone who wants data for any reason: perhaps for learning, for demos, for improving machine learning accuracy, perhaps.

Go check it out.

Comments closed

Creating Temp Staging Tables to Avoid Spooling

Bert Wagner shows how you can create your own tables in tempdb to avoid eager or lazy spools:

SQL Server Spool operators are a mixed bag. On one hand, they can negatively impact performance when writing data to disk in tempdb. On the other hand, they allow filtered and transformed result sets to be temporarily staged, making it easier for that data to be reused again during that query execution.

The problem with the latter scenario is that SQL Server doesn’t always decide to use a spool; often it’s happy to re-read (and re-process) the same data repeatedly. When this happens, one option you have is to explicitly create your own temporary staging table that will help SQL Server cache data it needs to reuse.

The other problem with spooling is that the spool doesn’t have indexes and so performance can be awful. When I look at an execution plan, one of my immediate red flags is spooling: if we have that, removing it is one of the first candidates for optimization after the trivial stuff (expected scan/seek behavior, “fat pipes” from excessive row counts, residual I/O, etc.).

Comments closed

Splitting Arrays with OPENJSON

Dave Mason continues a journey into parsing JSON with T-SQL:

Starting with SQL Server 2016, Microsoft provided a STRING_SPLIT function. It is a table-valued function that splits a string into rows of substrings, based on a specified separator character. It’s been a welcome addition that we waited a long time for. It has one shortcoming, though: the order of the output rows is not guaranteed to match the order of the substrings in the input string.

Microsoft also provided support for parsing JSON data starting with SQL Server 2016. I discovered the OPENJSON function can be used to split strings, and it can also return the ordinal position of each substring from the original input string.

There are some limitations which you’d expect, namely around requirements for valid JSON.

Comments closed

Horizontal Labels with ggplot

Michael Toth shows us how to ensure we use horizontal text labels in ggplot:

There are several things we could do to improve this graph, but in this guide let’s focus on rotating the y-axis label. This simple change will make your graph so much better. That way, people won’t have to tilt their heads like me to understand what’s going on in your graph:

It may not seem like much when you’re creating the visual, but it can make a difference for a viewer.

Comments closed

Repeated Cross-Validation in R

Ludvig Olsen walks us through a couple of nice R packages:

The benefits of using groupdata2 to create the folds are 1) that it allows us to balance the ratios of our output classes (or simply a categorical column, if we are working with linear regression instead of classification), and 2) that it allows us to keep all observations with a specific ID (e.g. participant/user ID) in the same fold to avoid leakage between the folds.

The benefit of cvms is that it trains all the models and outputs a tibble (data frame) with results, predictions, model coefficients, and other sweet stuff, which is easy to add to a report or do further analyses on. It even allows us to cross-validate multiple model formulas at once to quickly compare them and select the best model.

Ludvig also gives us some examples of how both packages can help you out. H/T R-Bloggers

Comments closed

Using Power Query to Expand Out Missing Dates

Matt Allington solves a problem in Power Query:

Suppose you have data in the form of dates (not consecutive) with a value for each of the dates (see the table below left side). You need to expand the rows of the table (create the missing rows) so that you will have all the consecutive dates in the given range and each of the dates has the previous updated value (see the table below right side).

The solution has a pretty large number of steps but is straightforward.

Comments closed

Rewriting Expensive Updates

Erik Darling takes us through an experiment:

Let’s also say that bad query is taking part in a modification.

UPDATE u2
SET u2.Reputation *= 2
FROM Users AS u
JOIN dbo.Users AS u2
ON CHARINDEX(u.DisplayName, u2.DisplayName) > 0
WHERE u2.Reputation >= 100000;
AND u.Id <> u2.Id;

This query will run for so long that we’ll get sick of waiting for it. It’s really holding up writing this blog post.

Erik rewrites this query a couple of times. Click through to learn what he does and why he does it.

Comments closed

Triggers and Multi-Record Changes

Brent Ozar points out a common problem with trigger design:

When you declare variables and set them using one row from the INSERTED or DELETED virtual table, you have no idea which row you’re going to get. Even worse, sometimes this trigger will update one row, and sometimes it won’t – because it might happen to grab a row with a reputation under 1,000!

It’s an easy mistake to make and one which can have a major impact.

Comments closed

Enabling Large Memory Pages in SQL Server

David Klee talks us through large memory pages:

SQL Server Enterprise Edition can leverage large memory pages to reduce the amount of memory pointers required for larger SQL Server deployments. Reducing the number of pointers makes the database engine more efficient, especially for SQL Servers with greater than 32GB of RAM. A normal memory block is 4KB, and many thousands of pointers are required to manage the memory underneath a larger SQL Server. Large memory pages can change the block size to 2MB, greatly reducing the number of pointers required for memory management.

Read on to see what effect this has, as well as when to use them and—more importantly—when not to use them.

Comments closed