2019-10-29 – Curated SQL

Linear models are a standard way of predicting or explaining some data. Visualizing data is not only of didactical value but provides heuristical value too, as demonstrated by Anscombe’s Quartet.
Visualizing linear models in 2D is straightforward, but visualizing linear models with more than one predictor is much less so. The aim of this post is to demonstrate some ways do visualize linear models with more than one predictor, using popular R packages. We will focus on 3D examples, that is, two predictors.

I have a strong bias against 3D visuals because they tend to be so difficult to see clearly. There are times when they’re necessary, though.

Comments closed

Joining RDDs in Spark

Published 2019-10-29 by Kevin Feasel

Brad Llewellyn takes us through more Spark RDD and DataFrame exercises, including joins:

We can make use of the built-in .join() function for RDDs. Similar to the .aggregateByKey() function we saw in the previous post, the .join() function for RDDs requires a 2-element tuple, with the first element being the key and the second element being the value. So, we need to use the .map() function to restructure our RDDs to store the keys in the first element and the original array/tuple in the second element. After the join, we end up with an awkward nested structure of arrays and tuples that we need to restructure using another .map() function, leading to a lengthy code snippet.

This is a place where DataFrames make so much more sense.

Comments closed

Cardinality Estimation of Table Varaibles with Nullable Columns

Published 2019-10-29 by Kevin Feasel

Milos Radivojevic takes us through a quick demonstration of a change in SQL Server 2019:

By using the same formula, the estimated number of rows is:
SELECT 0.001992032*1000000 --1992.032000000
This is exactly what we see in the execution plan. OK, that was CL 140, let’s see how SQL Server 2019 handles this simple case.
When we switch to CL 150, the plan and estimations for the c1 column (non-nullable) are the same. However, the estimation for the nullable column is changed!

Read the whole thing.

Comments closed

Simulating ON DELETE CASCADE

Published 2019-10-29 by Kevin Feasel

Aaron Bertrand has put together a procedure which simulates what a cascading delete would look like given your existing foreign keys:

A user recently asked about an interesting scenario involving tables with foreign key relationships. They wanted to generate the DELETE statements that would allow them to manually delete from child tables first (perhaps in stages), based on criteria they define, and tell them – before performing the deletes – how many rows the operation would affect in each table. They wanted output like this:
DELETE dbo.ChildTable1 WHERE ParentID < <some constant>; -- This would delete 47 row(s).
DELETE dbo.ChildTable2 WHERE ParentID < <some constant>; -- This would delete 14 row(s). ... DELETE dbo.ParentTable WHERE ID < <some constant>; -- This would delete 11 row(s).

Click through for the solution as well as several caveats.

Comments closed

Improving Post-Join Aggregation Performance in Power Query

Published 2019-10-29 by Kevin Feasel

Imke Feldmann finds some nice performance improvements with aggregating data after a join using Power Query:

When you join a table to another table in Power Query, the UI gives you the option to either expand the columns (default) or aggregate the contents of the joint tables. That’s useful if multiple rows are returned for the rows of the table that has been joined to (left table):
But this method is extremely slow. Compared to “simply” expanding all values to new rows (which took around 5 seconds), the aggregation took around 50 seconds. The automatically generated code uses the “Table.AggregateTableColumn”-function.

Read on to see two separate attempts to speed things up.

Comments closed

Scripting Out Linked Servers with Actual Passwords

Published 2019-10-29 by Kevin Feasel

Ajay Dwivedi shows how you can script out a linked server creation statement which includes actual passwords:

For moving Logins/Users, Microsoft provided revlogin script which made it easy for migration of logins without need to know about passwords. But, there is no easy approach for migration LinkedServers with the actual password. This is where dbatools cmdlet Copy-DbaLinkedServer becomes very handy. But, what about the situation where we have to script out LinkedServer beforehand.
For this reason, based on the blog post of Antti Rantasaari, and using his code as the base script, I have created a cmdlet Get-LinkedServer with SQLDBATools module which accepts SqlInstance name as a parameter along with -ScriptOut switch, and gives Drop/Create statements for linked servers present on that local/remote SqlInstance.

As a quick note, SQLDBATools is not the same as dbatools.

Comments closed

Incremental Data Migration to Blob Storage

Published 2019-10-29 by Kevin Feasel

Ginger Daniel has started a series on data migration into Azure Blob Storage:

Part 1 of this article demonstrates how to upload multiple tables from an on-premise SQL Server to an Azure Blob Storage account as csv files. I covered these basic steps to get data from one place to the other using Azure Data Factory, however there are many other alternative ways to accomplish this, and many details in these steps that were not covered. For a deep-dive into the details you can start here https://docs.microsoft.com/en-us/azure/data-factory/introduction, and https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-portal#create-a-pipeline.

Part 1 was chock full of information, and it looks like Part 2 will be as well.

Comments closed

Analyzing DAX DISTINCTCOUNT Performance

Published 2019-10-29 by Kevin Feasel

Marco Russo takes a look at how fast DISTINCTCOUNT is in DAX:

This article shows how to implement the same DISTINCTCOUNT calculation in two alternative ways, measuring and comparing the performance in different reports. You will see that while DISTINCTCOUNT can be implemented using SUMX / DISTINCT, the DISTINCTCOUNT version is usually better. That is, unless the density of the reports is high and the calculation apply filters to the measures that do not correspond to the grouping granularity of the visualization – as is always the case using time intelligence functions. There are cases where SUMX / DISTINCT can offer better performance, but you have to clarify whether optimizing one report might slow down many others. Measuring performance using DAX Studio is the only way to know what to expect for your model and reports.

Read on for the full test.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Day: October 29, 2019

Plotting Three-Dimensional Linear Models

Joining RDDs in Spark

Cardinality Estimation of Table Varaibles with Nullable Columns

Simulating ON DELETE CASCADE

Improving Post-Join Aggregation Performance in Power Query

Scripting Out Linked Servers with Actual Passwords

Incremental Data Migration to Blob Storage

Analyzing DAX DISTINCTCOUNT Performance