2017-05-22 – Curated SQL

As of CTP 2.1 for SQL Server 2017 a set of new environment variables are available. These variables allow us to configure each SQL Server container as we spin them up. The first version of SQL Server on Linux came with:

ACCEPT_EULA
SA_PASSWORD

These had to be set for the container to start. The SA_PASSWORD has be a complex password or the container will not start. CTP 2.1 introduced:

Read on for the new variables and an example on how to use them.

Comments closed

Three-Way Variance Analysis

Published 2017-05-22 by Kevin Feasel

Bogdan Anastasiei shows how to perform a three-way variance analysis when the third-order and second-order effects are both statistically significant:

In the formula above the interaction effect is, of course, dosegendertype. The ANOVA results can be seen below (we have only kept the line presenting the third-order interaction effect).
Df Sum Sq Mean Sq F value   Pr(>F)
dose:gender:type   2    187    93.4  22.367 3.81e-10
The interaction effect is statistically significant: F(2)=22.367, p<0.01. In other words, we do have a third-order interaction effect. In this situation, it is not advisable to report and interpret the second-order interaction effects (they could be misleading). Therefore, we are going to compute the simple second-order interaction effects.

This is definitely not a trivial article, but there are useful techniques in it.

Comments closed

Securing Kafka-To-Spark

Published 2017-05-22 by Kevin Feasel

Mark Grover explains how to secure communications between Apache Kafka and Apache Spark:

However, to read data from secure Kafka in distributed fashion, we need Hadoop-style delegation tokens in Kafka (KAFKA-1696), support for which doesn’t exist at the time of this writing (Spring 2017).

We considered various ways to solve this problem but ultimately decided that the recommended solution to read data securely from Kafka (at least until Kafka delegation tokens support is introduced) would be for the Spark application to distribute the user’s keytab so it’s accessible to the executors. The executors will then use the user’s keytab shared with them, to authenticate with the Kerberos Key Distribution Center (KDC) and read from Kafka brokers. YARN distributed cache is used for shipping and sharing the keytab to the driver and executors, from the client (that is, the gateway node). The figure below shows an overview of the current solution.

This turns out to be a bit more difficult than I would have anticipated.

Comments closed

Basic Q&A Maker With Non-Default Error Message

Published 2017-05-22 by Kevin Feasel

Steph Locke shows how to take a Q&A bot and change the default message when there is no relevant match:

Last week, I blogged about building a no-code Q&A bot for your website. One little niggle I had with the bot was the response when it could match a user input to a Q&A. I wondered how to change “No match found!”.

I looked around the qnamaker.ai site and couldn’t find a place I could change this. I submitted some feedback and the great people at the other of the Q&A site responded super quickly. I’ve raised a number of feedback points with them and I must say they’ve absolutely amazing at responding. But I digress.

Read on for the answer.

Comments closed

Reusing Power BI Datasets

Published 2017-05-22 by Kevin Feasel

Melissa Coates shows a few ways to reduce the number of redundant Power BI datasets in an environment:

To reuse an imported dataset, there are three options I’m aware of:

Report in the Power BI Service. This refers to using the web interface for creating a new report separate from the original PBIX file.
Analyze In Excel. This refers to creating an Excel report and can currently be used by anyone with read or edit access to the dataset. Hence, very useful for self-service BI purposes.
Power BI Service Live Connection. This refers to creating a Power BI Desktop report. This option can currently only be used by people with edit permissions on the dataset (so not appropriate for broad self-service BI reporting at this time).

Click through for a detailed demonstration of each.

Comments closed

Gathering Detailed System Information On A Set Of Servers

Published 2017-05-22 by Kevin Feasel

Amy Herold has a quick Powershell script to retrieve detailed system info (via msinfo32.exe) for a set of servers:

With this script you can generate system information files and save them to a specified location. It makes sure a connection can be made to the server first, and then outputs the file. The files are created one at a time, so if you pass in a longer list of servers, you shouldn’t crash your machine. From my testing, this will take some time to run as these files don’t output quickly. Despite that, the output is worth it. This can be modified to pull your list of servers from a file or from a Central Management Server (CMS) instance.

This is a useful script, with the next step being to turn it into a cmdlet that accepts the set of servers from the pipeline.

Comments closed

Azure SQL DW Reference Architectures

Published 2017-05-22 by Kevin Feasel

James Serra shows how Azure SQL Data Warehouse can fit into various warehousing architectures:

Do staging, data refinement and reporting all from SQL DW. You can scale compute power up when needed (i.e. during staging, data refinement, or large number of users doing reporting) or down to save costs (i.e. nights and weekends when user reporting is low). The pros of this option are by reducing the number of technologies you are building a simpler solution and reducing the number of copies of the data. The cons are since everything is done on SQL DW you can have performance issues (i.e. doing data refinement while users are reporting), can hit the SQL DW concurrent query limit, and can have a higher cost since SQL DW is the highest-cost product, especially if you are not able to pause it. Pausing it reduces your cost to zero for compute, only having to pay for storage (see Azure SQL Data Warehouse pricing), but no one can use SQL DW when paused

Click through for three other architecture ideas.

Comments closed

Loading Azure SQL DW: Embrace The CTAS

Published 2017-05-22 by Kevin Feasel

John Hoang, Joe Sack, and Martin Lee explain different loading patterns for loading Azure SQL Data Warehouse:

SQL Data Warehouse supports many loading methods, including SSIS, BCP, the SQLBulkCopy API, and Azure Data Factory (ADF). These methods all share a common pattern for data ingestion. By comparison, the PolyBase technology uses a different approach that provides better performance.

PolyBase is by far the fastest and most scalable SQL Data Warehouse loading method to date, so we recommend it as your default loading mechanism. PolyBase is a scalable, query processing framework compatible with Transact-SQL that can be used to combine and bridge data across relational database management systems, Azure Blob Storage, Azure Data Lake Store and Hadoop database platform ecosystems (APS only).

If you’re curious about Azure SQL Data Warehouse, this is a must-read.

Comments closed

SQL Server 2017 CTP 2.1 Installation Changes

Published 2017-05-22 by Kevin Feasel

Dan English points out some changes to the installation GUI in SQL Server 2017 CTP 2.1:

So we see that now for Machine Learning there is also Python support and that SSIS has new scale out capabilities. In addition to this Reporting Services – Native installation has been removed (and no more SharePoint integration!).

Dan covers several tools, including Power BI Report Server and client.

Comments closed

More On Adaptive Joins

Published 2017-05-22 by Kevin Feasel

Erik Darling has a couple more posts on adaptive joins in SQL Server 2017. First, he wonders what happens when you add scalar functions to the mix:

See, this isn’t SARGable either (and no, SCHEMABINDING doesn’t change this). When a predicate isn’t SARGable, you take away an index seek as an access choice. You don’t see too many Nested Loops with an index scan on the other end, do you?

No.

So there you go. It’s not the function itself that bops our Adaptive Join on the head, but the lack of SARGability.

Erik also looks at using APPLY and EXISTS to make sure you can use adaptive joins with that dynamic duo:

I do have to point out that Cross Apply used to only be implemented as a Nested Loops Join. I learned that many years ago from one of the best articles written about Cross Apply by Paul White. That changed recently — it’s possible to see it implemented with a Hash Join in at least 2016. I’ve seen it crop up in Cross Apply queries without a TOP operator.

The latter results are a bit surprising.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Day: May 22, 2017

SQL On Docker Environment Variables

Three-Way Variance Analysis

Securing Kafka-To-Spark

Basic Q&A Maker With Non-Default Error Message

Reusing Power BI Datasets

Gathering Detailed System Information On A Set Of Servers

Azure SQL DW Reference Architectures

Loading Azure SQL DW: Embrace The CTAS

SQL Server 2017 CTP 2.1 Installation Changes

More On Adaptive Joins