Press "Enter" to skip to content

Curated SQL Posts

Automated Cleanup With Query Store

Grant Fritchey discusses Query Store’s automated cleanup and also looks at an interesting question:

Query Store has mechanisms for automatically cleaning your data. It is possible to cause them to break down. While presenting a session about the Query Store recently, I was asked what happened if you set the size of the Query Store below the amount of data currently in the store. I didn’t know the answer, so we tried it. Things got a little weird.

Click through to see how weird.

Comments closed

Unique Indexes Versus Unique Constraints

Greg Low argues that you should create unique constraints instead of unique indexes whenever possible:

The CREATE INDEX statement is used to do exactly what its name says, it creates an index. But when you say CREATE UNIQUE INDEX, you are doing more than that; you are enforcing a business rule that involves uniqueness.

I have a simple rule on this. Wherever possible business rules like uniqueness, check values, etc. should be part of the design of the table, and not enforced in an external object like an index.

So, rather than a unique index, I’d rather see a unique constraint on the underlying table.

But that’s where real life steps in. I see two scenarios that lead me to occasionally use CREATE UNIQUE INDEX.

Here’s a third:  creating constraints can cause blocking issues.  If you already have a large table and Enterprise Edition, creating a unique index can be an online operation (unless you have a clustered columnstore index on the table), but a unique constraint is always a blocking activity.

Comments closed

Backing Up Azure SQL Databases

Arun Sirpal enumerates the options we have for backups of Azure SQL Databases:

If you have a business requirement which has a need to retain database backups for longer than 35 days, then you have an option to use long-term backup retention. This feature utilises the Azure Recovery Services Vault where you can store up to 10 years’ worth of backups for up to 1000 databases per vault and 25 vaults per subscription.

There are some guidelines that you need to follow to successful set this up:

  • Your vault MUST be in the same region, subscription and resource group as your logical SQL Server, if not then you will not be able to set this up.

  • Register the vault to the server.

  • Create a protection policy.

  • Apply the above policy to the databases that require long-term backup retention.

Arun also looks at restoration options.

Comments closed

The Data Exploration Process

Stacia Varga takes a step back from analyzing NHL data to explore it a little more:

As I mentioned in my last post, I am currently in an exploratory phase with my data analytics project. Although I would love to dive in and do some cool predictive analytics or machine learning projects, I really need to continue learning as much about my data as possible before diving into more advanced techniques.

My data exploration process has the following four steps:

  1. Assess the data that I have at a high level

  2. Determine how this data is relevant to the analytics project I want to undertake

  3. Get a general overview of the data characteristics by calculating simple statistics

  4. Understand the “middles” and the “ends” of your numeric data points

There’s some good stuff in here.  I particularly appreciate Stacia’s consideration of data exploration as an iterative process.

Comments closed

The Blame Game

Kenneth Fisher has a board game for us:

It’s Monday morning and your manager Brent has called his usual emergency all-employee meeting. He looks more than a little bit unhappy, and this time it’s not because someone stole his cruller. Over the weekend he was demonstrating the new anatomy program Mr. Body to some investors and frankly the performance was miserable! Now Brent has only one question.

Who killed Mr. Body’s performance?

We all know Andy Mallon did it.

Comments closed

Check Constraints To Block Leading And Trailing Spaces

Louis Davidson shows how to use check constraints to block people from inserting records with leading or trailing spaces:

Then, let’s say the requirements are as follows:

1. No values that are either empty or only spaces
2. No leading spaces
3. No trailing spaces
4. Allow NULL if column allows NULL

Let’s look at how we could implement all of these independently, as there certainly are cases where you may wish to allow any or all of the situations in a column.

Click through for the scripts, as well as a time comparison to see how much overhead you’re adding.

Comments closed

Continuous Processing Mode With Spark Structured Streaming

Joseph Torres, et al, explain how continuous processing mode works with Apache Spark 2.3’s structured streaming:

Suppose we want to build a real-time pipeline to flag fraudulent credit card transactions. Ideally, we want to identify and deny a fraudulent transaction as soon as the culprit has swiped his/her credit card. However, we don’t want to delay legitimate transactions as that would annoy customers. This leads to a strict upper bound on the end-to-end processing latency of our pipeline. Given that there are other delays in transit, the pipeline must process each transaction within 10-20 ms.

Let’s try to build this pipeline in Structured Streaming. Assume that we have a user-defined function “isPaymentFlagged” that can identify the fraudulent transactions. To minimize the latency, we’ll use a 0 second processing time trigger indicating that Spark should start each micro batch as fast as it can with no delays.

They also explain how this newer model differs from the prior model of collecting events in microbatches.

Comments closed

Lazy Evaluation With Scala

Mahesh Chand demonstrates how Scala can use lazy evaluation to reduce memory requirements:

In this blog, we will talk about lazy evaluation in Scala. How we can add efficiency to our application?

Efficiency is achieved not just by running things faster, but by avoiding things that shouldn’t be done in the first place.

In functional programming, lazy evaluation means efficiency.  Laziness lets us separate the description of an expression from the evaluation of that expression. This gives us a powerful ability—we may choose to describe a “larger” expression than we need, and then evaluate only a portion of it. There are many ways to achieve lazy evaluation in Scala i.e using lazy keyword, views, streams etc.

The fastest operation is the one that doesn’t have to run at all.

Comments closed

Building SQL Server Database Projects With YAML

Gavin Campbell shows how to use Visual Studio Team Services’s new YAML build pipeline to build a SQL Server Data Tools project:

The documentation for how to specify build steps in YAML is still a work in progress. In summary, the current procedure is to visit the VSTS Tasks repo on GitHub, open the folder for the task your are interested in, and take a look at the task.json.

In our case, the first task we need is MSBuild, to build the database project. Looking inside task.json, we can see that the name of the task we need is MSBuild, and that there are a huge number of available inputs we can use to configure the task; solution to specify the project or solution to build, platformconfiguration, and many more. In our case, we’ll just specify the path to our .sqlproj file and let msbuild take care of the rest.

Gavin walks us through the entire process step by step, though as he notes, some of these steps are liable to change as the VSTS team continues implementation.

Comments closed

Uploading Files To Azure Blob Storage With Data Factory V2

Ben Jarvis shows how to use Azure Data Factory V2 to upload files from an on-prem server to Azure Blob Storage:

In ADF V2 the integration runtime is responsible for providing the compute infrastructure that carries out data movement between data stores. A self-hosted integration runtime is an on-premise version of the integration runtime that is able to perform copy activities to and from on-premise data stores.

When we configure a self-hosted integration runtime the data factory service, that sits in Azure, will orchestrate the nodes that make up the integration runtime through the use of Azure Service Bus meaning our nodes that are hosted on-prem are performing all of our data movement and connecting to our on-premises data sources while being triggered by our data factory pipelines that are hosted in the cloud. A self-hosted integration runtime can have multiple nodes associated with it, which not only caters for high availability but also gives an additional performance benefit as ADF will use all of the available nodes to perform processing.

Read on for the scripts and full process.

Comments closed