Press "Enter" to skip to content

Curated SQL Posts

Avoiding OR Clauses in Joins

Erik Darling wants you to embrace the power of AND:

I had to write some hand-off training about query tuning when I was starting a new job.

As part of the training, I had to explain why writing “complicated logic” could lead to poor plan choices.

So I did what anyone would do: I found a picture of a pirate, named him Captain Or, and told the story of how he got Oared to death for giving confusing ORders.

Click through for a troublesome query and a few ways of rewriting it to be less troublesome. My goto is typically to rewrite as two statements with a UNION ALL between them if I can.

Comments closed

Importing Performance Analyzer Results into DAX Studio

Marco Russo shows how you can take data from Performance Analyzer and load it into DAX Studio:

DAX Studio helps you navigate through the performance metrics; it improves the productivity in executing one or more of the collected queries, reducing the number of copy/paste operations required. In order to use the DAX Studio feature, first you must export the data collected by Performance Analyzer in Power BI Desktop.

Click through to see how it’s done.

Comments closed

Building an ARM Template for Azure Data Factory

Andy Leondard takes the first steps to building an Azure Data Factory pipeline using Azure Resource Manager Templates:

Azure Resource Manager, or ARM, “allows you to provision your applications using a declarative template.” So says the Azure Quickstart Templates page. ARM templates are JSON and allow administrators to import and export Azure resources using varying management patterns. I really like ARM templates for implementing infrastructure as code in Azure. In this post I show a very simple example of how to use ARM templates to export and then import a basic ADF (Azure Data Factory) pipeline.

The sample code doesn’t do that much by itself, but it does open up a new world of automation.

Comments closed

Time Series Modeling with Gluon

Jan Gasthaus, et al, announce a new open source product release:

We are excited to announce the open source release of Gluon Time Series (GluonTS), a Python toolkit developed by Amazon scientists for building, evaluating, and comparing deep learning–based time series models. GluonTS is based on the Gluon interface to Apache MXNet and provides components that make building time series models simple and efficient.

In this post, I describe the key functionality of the toolkit and demonstrate how to apply GluonTS to a time series forecasting problem.

It looks interesting.

Comments closed

Diving Into Index Scans

Hugo Kornelis explains how index scans work in SQL Server:

The logic of the Index Scan operator itself is fairly simple, but the actual actions carried out can vary hugely depending on the type of index being scanned (as defined in the Storage and IndexKind properties). Most of this logic is carried out at the level of the storage engine. Since an understanding of this is important to get a proper understanding of the performance of this operator, the actual actions carried out at the level of the storage engine will be described on this page as well.

The current version of SQL Server (2017) supports four types of index storage. The Storage property distinguishes between RowStore, ColumnStore, and MemoryOptimized; for the latter type only IndexKind further differentiates this into NonClustered and NonClusteredHash.

Scans are an important part of the database engine and knowing how they work helps us understand when they’re the right choice for the job.

Comments closed

An Introduction to Azure Databricks

Brad Llewellyn has an introduction to Azure Databricks:

So, what is Azure Databricks?  To answer this question, let’s start all the way at the bottom of the hole and climb up.  So, what is Hadoop?  Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations.  It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things.  However, Hadoop solutions are extremely complex to manage and develop.  So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems.  Apache Spark is one of these platforms.  You can read more about Apache Hadoop here and here.

It’s Hadoop turtles all the way down.

Comments closed

Choosing Clustered Index Columns

Ed Elliott wades into the clustered index debate:

I have seen this debated in forums spread over the internet for decades, and the advice that we gave ten years ago isn’t as valid today as it was then. Ten years ago, memory was considerably less, and disks were spinning rust. The advent of SSD’s and the ability to get servers with more memory than data, even on large systems have changed how we should think about designing and maintaining databases.

I generally subscribe to the NUSE philosophy: Narrow, Unique, Static, Ever-Increasing. That generally leads me to selecting identity integers or longs. For junction tables (whose entire purpose is to join two tables together and which never get referenced outside of that), I use the primary key as the clustered index.

In extreme insert scenarios, I can see wanting to maximize fragmentation in order to insert into more pages in the B-tree and avoid hot spot pages.

Comments closed

Finding High-Resource Queries with Extended Events

Grant Fritchey shows how you can create an extended event which identifies high-CPU queries:

A question that comes up on the forums all the time: Which query used the most CPU. You may see variations on, memory, I/O, or just resources in general. However, people want to know this information, and it’s not readily apparent how to get it.

While you can look at what’s in cache through the DMVs to see the queries there, you don’t get any real history and you don’t get any detail of when the executions occurred. You can certainly take advantage of the Query Store for this kind of information. However, even that data is aggregated by hour. If you really want a detailed analysis of which query used the most CPU, you need to first set up an Extended Events session and then consume that data.

Click through for the script.

Comments closed

Power Query List Expansion Problems

Chris Webb goes over an issue with an attempt to expand out a set of folders in M:

The approach I took was the one that seemed natural to me at the time:

1. Use the Folder data source to connect to the folder containing the image files
2. Define a function called SplitText that takes a long piece of text and splits it up into a list of text values no longer than 30000 characters
3. Call the function once per row on the table returned by step (1)
4. Use the Expand/Aggregate button to expand the new column created by step (3) and get a table with one row for each of the split-up text values

When I ran this query, though, I caught sight of something that is every Power Query developer’s worst nightmare:

Read on for more. Also, drop by to congratulate Chris on collecting a blue badge.

Comments closed

Securing Power BI

Andy Jones has 10 tips for securing your Power BI infrastructure:

9 Sharing Externally

Power BI offers the ability to share reports outside of your organisation or even publish to the public internet. If this causes you to shudder, turn these features off. Your Power BI admin (remember them from above) should open the admin portal and move a slider – problem solved.

Turn off sharing externally (unless needed)

Click through for the full list.

Comments closed