# Day: October 29, 2019

Linear models are a standard way of predicting or explaining some data. Visualizing data is not only of didactical value but provides heuristical value too, as demonstrated by Anscombe’s Quartet.

Visualizing linear models in 2D is straightforward, but visualizing linear models with more than one predictor is much less so. The aim of this post is to demonstrate some ways do visualize linear models with more than one predictor, using popular R packages. We will focus on 3D examples, that is, two predictors.

I have a strong bias against 3D visuals because they tend to be so difficult to see clearly. There are times when they’re necessary, though.

We can make use of the built-in .join() function for RDDs.  Similar to the .aggregateByKey() function we saw in the previous post, the .join() function for RDDs requires a 2-element tuple, with the first element being the key and the second element being the value.  So, we need to use the .map() function to restructure our RDDs to store the keys in the first element and the original array/tuple in the second element.  After the join, we end up with an awkward nested structure of arrays and tuples that we need to restructure using another .map() function, leading to a lengthy code snippet.

This is a place where DataFrames make so much more sense.

By using the same formula, the estimated number of rows is:

`SELECT 0.001992032*1000000--1992.032000000`
This is exactly what we see in the execution plan. OK, that was CL 140, let’s see how SQL Server 2019 handles this simple case.

When we switch to CL 150, the plan and estimations for the c1 column (non-nullable) are the same. However, the estimation for the nullable column is changed!

A user recently asked about an interesting scenario involving tables with foreign key relationships. They wanted to generate the DELETE statements that would allow them to manually delete from child tables first (perhaps in stages), based on criteria they define, and tell them – before performing the deletes – how many rows the operation would affect in each table. They wanted output like this:

`DELETE dbo.ChildTable1 WHERE ParentID < <some constant>; -- This would delete 47 row(s).`

`DELETE dbo.ChildTable2 WHERE ParentID < <some constant>; -- This would delete 14 row(s). ... DELETE dbo.ParentTable WHERE ID < <some constant>; -- This would delete 11 row(s). `

Click through for the solution as well as several caveats.

When you join a table to another table in Power Query, the UI gives you the option to either expand the columns (default) or aggregate the contents of the joint tables. That’s useful if multiple rows are returned for the rows of the table that has been joined to (left table):

But this method is extremely slow. Compared to “simply” expanding all values to new rows (which took around 5 seconds), the aggregation took around 50 seconds. The automatically generated code uses the “Table.AggregateTableColumn”-function.

Read on to see two separate attempts to speed things up.

For this reason, based on the blog post of Antti Rantasaari, and using his code as the base script, I have created a cmdlet Get-LinkedServer with SQLDBATools module which accepts SqlInstance name as a parameter along with -ScriptOut switch, and gives Drop/Create statements for linked servers present on that local/remote SqlInstance.

As a quick note, SQLDBATools is not the same as dbatools.

Part 1 of this article demonstrates how to upload multiple tables from an on-premise SQL Server to an Azure Blob Storage account as csv files.  I covered these basic steps to get data from one place to the other using Azure Data Factory, however there are many other alternative ways to accomplish this, and many details in these steps that were not covered.  For a deep-dive into the details you can start here https://docs.microsoft.com/en-us/azure/data-factory/introduction, and https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-portal#create-a-pipeline

Part 1 was chock full of information, and it looks like Part 2 will be as well.

Marco Russo takes a look at how fast `DISTINCTCOUNT` is in DAX: