Press "Enter" to skip to content

Author: Kevin Feasel

The Shuffling Operator And Azure SQL DW

Arun Sirpal is ready to deal:

For the purposes of this post the TSQL shown is elementary (don’t be surprised by that), the point is really about SHUFFLE. So, I select the estimated plan for the following code.

SELECT SOD.[SalesOrderID],SOD.[ProductID], SOH.[TotalDue]
FROM [SalesLT].[SalesOrderDetail] SOD
JOIN [SalesLT].[SalesOrderHeader] SOH ON
SOH.[SalesOrderID] = SOD.[SalesOrderID]
WHERE SOH.[TotalDue] > 1000

Shuffle me once, why not shuffle me twice. If you REALLY want to see the EXPLAIN command output, then it looks like this snippet below.

The DSQL operation clearly states SHUFFLE_MOVE. Why am I getting this? What does it mean?

Shuffling data isn’t the worst thing in the world, but it is a fairly expensive operation all things considered.  Ideally, your warehouse architecture limits the number of shuffle operations, but considering that you can only hash on one key, sometimes it’s inevitable.

Comments closed

The Most Powerful Force In The Universe, In DAX

Matt Allington talks about compound interest and shows how to calculate it in DAX:

Now back to the point of this article.  Compounding growth is very easy to do in Excel because you can write individual cell formulas (to do what ever you want), and each new formula can reference the answer from the previous formula as the starting point for the new formula. There is no such ability in the DAX language.  To solve such problems in DAX, you have to change the way you think and start to think about how to write a single formula that will work over an entire TABLE of data (or columns or multiple tables) – no cell by cell individual formulas are possible.

Below I will step you through the process of finding a solution to this problem.  As I often mention in my articles, it is the process that I believe is most important.  I seldom know how to answer a complex DAX problem when I start out (that’s the very definition of “complex”), and instead I follow a process to help me solve the problem.  Take a careful read below.  If you apply the same process when you write your formulas, you will be well on your way to becoming a DAX Superhero.

It’s an interesting problem when your growth rate is not always the same, but Matt has you covered.

Comments closed

Loading CSVs Into Azure Using dbatools

Stuart Moore has a quick Powershell script which loads CSV data into Azure SQL Database using dbatools:

To get some of this data usable for reporting we’re importing it into Azure SQL Database so people can start working their way through it, and we can fix up errors before we push it through into Azure Data Lake for mining. Being a fan of dbatools it was my first port of call for automating something like this.

Just to make life interesting, I want to add a time of creation field to the data to make tracking trends easier. As this information doesn’t actually exist in the CSV columns, I’m going to use LastWriteTime as a proxy for the creationtime.

Click through for the script.

Comments closed

Spatial Workaround In Azure SQL Data Warehouse

Rolf Tesmer has you covered if you want to perform spatial queries against data in Azure SQL Data Warehouse:

Recently we had a requirement to perform SQL Spatial functions on data that was stored in Azure SQL DW.  Seems simple enough as spatial has been in SQL for many years, but unfortunately, SQL Spatial functions are not natively supported in Azure SQL DW (yet)!

If interested – this is the link to the Azure Feedback feature request to make this available in Azure SQL DW – https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10508991-support-for-spatial-data-type

AND SO — to use spatial data in Azure SQL DW we need to look at alternative methods.  Luckily a recent new feature in Azure SQL DB  in the form of Elastic Query to Azure SQL DW now gives us the ability to perform these SQL Spatial functions on data within Azure SQL DW via a very simple method!

Check out that Azure Feedback item if you’d like to see native spatial support rather than using elastic query.  In the meantime, click through to see Rolf’s workaround.

Comments closed

Expanded Tables In DAX

Alberto Ferrari has a great explanation of how the concept of expanded tables works in DAX:

If you are coming from an SQL background, or if you are used to relational databases, you probably think that RELATED follows relationships. Thus, to compute the Month column, you would think that the engine followed a relationship between Sales and Date and obtained the value of the month by performing a lookup on the Date table.

DAX is different. Date[Month] belongs to the expanded version of Sales, There is a value for RELATED(Date[Month]) because Sales was expanded to include Date using a relationship.
RELATED requires a row context to be active. If you remove the row context of the calculated column, then RELATED no longer works.

This post cleared up a couple of ideas in my head, so check it out.

Comments closed

Matching Order In R

John Mount shows off the match_order function in wrapr:

Often we wish to work with such data aligned so each row in d2 has the same idx value as the same row (by row order) as d1. This is an important data wrangling task, so there are many ways to achieve it in R, such as base::merge()dplyr::left_join(), or by sorting both tables into the same order and then using base::cbind().

However if you wish to preserve the order of the first table (which may not be sorted), you need one more trick.

Click through to see that one additional trick.

Comments closed

Building A Model: Lumping And Splitting

Anna Schneider and Alex Smolyanskaya explain some of the tradeoffs between lumping groups together and splitting them out when it comes to algorithm selection:

At Stitch Fix, individual personalization is in our DNA: every client is unique and every piece of clothing we send is chosen to be just right. When we buy merchandise, we could choose to lump clients together; algorithms trained on lumped data would steer us toward that little black dress or those comfy leggings that delight a core, modal group of clients. Yet when we split clients into narrower segments and focus on the tails of the distribution, the algorithms have the chance to also tease out that sleek pinstripe blazer or that pair of distressed teal jeans that aren’t right for everyone, but just right for someone. As long as we don’t split our clients so finely that we’re in danger of overfitting, and as long as humans can still understand the algorithm’s recommendations, splitting is the way to go.

In other cases lumping can provide action-oriented clarity for human decision-makers. For example, we might lump clients into larger groups when reporting on business growth for a crisp understanding of holistic business health, even if our models forecast that growth at the level of finer client splits.

Read on and check out their useful chart for figuring out whether lumping or splitting is the better idea for you.

Comments closed

Closure Tables: Graph Data In Relational Form

Phil Factor shows how to use the concept of closure tables to represent graph-style data in a relational database:

Closure tables are plain ordinary relational tables that are designed to work easily with relational operations. It is true that useful extensions are provided for SQL Server to deal with hierarchies. The HIERARCHYID data type and the common language runtime (CLR) SqlHierarchyId class are provided to support the Path Enumeration method of representing hierarchies and are intended to make tree structures represented by self-referencing tables more efficient, but they are likely to be appropriate for some but not all the practical real-life hierarchies or directories. As well as path enumerations, there are also the well-known design patterns of Nested Sets and Adjacency Lists. In this article, we’ll concentrate on closure tables.

A directed acyclic graph (DAG) is a more general version of a closure table. You can use a closure table for a tree structure where there is only one trunk, because a branch or leaf can only have one trunk. We just have a table that has the nodes (e.g. staff member or directory ‘folder’) and edges (the relationships). We are representing an acyclic (no loops allowed) connected graph where the edges must all be unique, and where there is reflexive closure. (each node has an edge pointing to itself)

Take the time to read this one carefully, as I think this model is applicable much more often than it’d appear at first blush.

Comments closed

Reverse Engineering The Stream Aggregate Algorithm

Itzik Ben-Gan has started a series of articles on optimizing queries which use grouping and aggregating with a reverse-engineering of the stream aggregate algorithm:

As you may already know, when SQL Server optimizes a query, it evaluates multiple candidate plans, and eventually picks the one with the lowest estimated cost. The estimated plan cost is the sum of all the operators’ estimated costs. In turn, each operator’s estimated cost is the sum of the estimated I/O cost and estimated CPU cost. The cost unit is meaningless in its own right. Its relevance is in the comparison that the optimizer makes between candidate plans. That is, the costing formulas were designed with the goal that, between candidate plans, the one with the lowest cost will (hopefully) represent the one that will finish more quickly. A terribly complex task to do accurately!

The more the costing formulas adequately take into account the factors that truly affect the algorithm’s performance and scaling, the more accurate they are, and the more likely that given accurate cardinality estimates, the optimizer will choose the optimal plan. At any rate, if you want to understand why the optimizer chooses one algorithm versus another you need to understand two main things: one is how the algorithms work and scale, and another is SQL Server’s costing model.

So back to the plan in Figure 1; let’s try and understand how the costs are computed. As a policy, Microsoft will not reveal the internal costing formulas that they use. When I was a kid I was fascinated with taking things apart. Watches, radios, cassette tapes (yes, I’m that old), you name it. I wanted to know how things were made. Similarly, I see value in reverse engineering the formulas since if I manage to predict the cost reasonably accurately, it probably means that I understand the algorithm well. During the process you get to learn a lot.

Our query ingests 1,000,000 rows. Even with this number of rows, the I/O cost seems to be negligible compared to the CPU cost, so it is probably safe to ignore it.

As for the CPU cost, you want to try and figure out which factors affect it and in what way.

I give this my highest recommendation.

Comments closed

SQL Server 2017 On Linux: The Azure VM Method

Prashanth Jayaram shows how to spin up an Azure VM running SQL Server 2017 on Linux:

To create the VMs, you need to go through these four steps:

  1. Basics to configure basic setting of the VM

  2. Size to choose the VM machine size

  3. Settings to configure the features. In this case, the default values are used. You just need to click the Next button to proceed further

  4. Purchase

Once it’s running, Prashanth shows how to connect via PuTTY and configure the service.

Comments closed