Author: Kevin Feasel

I want to show the two modules running against a number of SQL Versions so I have installed

2 Domain Controllers

2 SQL 2017 instances on Windows 2016 with an Availability Group and WideWorldImporters database

1 Windows 2016 jump box with all the programmes I need

1 Windows 2016 with containers

using a VSTS build and this set of ARM templates and scripts

I wanted to create containers running SQL2017, SQL2016, SQL2014 and SQL2012 and restore versions of the AdventureWorks database onto each one.

Rob shows how to do this all via Powershell so you can automate the process.

Comments closed

When Stream Aggregates Require Sorting

Published 2018-05-11 by Kevin Feasel

Itzik Ben-Gan continues his series on grouping and aggregation:

Let’s try and figure out the costing formula for the Sort operator. Remember, our focus is the estimated cost and scaling because our ultimate goal is to figure out optimization thresholds where the optimizer changes its choices from one strategy to another.

The I/O cost estimate seems to be fixed: 0.0112613. I get the same I/O cost irrespective of factors like number of rows, number of sort columns, data type, and so on. This is probably to account for some anticipated I/O work.

As for the CPU cost, alas, Microsoft doesn’t publicly expose the exact algorithms that they use for sorting. However, among the common algorithms used for sorting by database engines in general are different implementations of merge sort and quicksort. Thanks to efforts made by Paul White, who’s fond of looking at Windows debugger stack traces (not all of us have the stomach for this), we have a bit more insight into the topic, published in his series “Internals of the Seven SQL Server Sorts.” According to Paul’s findings, the general sort class (used in the above plan) uses merge sort (first internal, then transitioning to external). In average, this algorithm requires n log n comparisons to sort n items. With this in mind, it’s probably a safe bet as a starting point to assume that the CPU part of the operator’s cost is based on a formula such as:

Operator CPU cost = <startup cost> + @numrows * LOG(@numrows) * <comparison cost>

Of course, this could be an oversimplification of the actual costing formula that Microsoft uses, but absent any documentation on the matter, this is an initial best guess.

It’s interesting to see how the calculation changes in form with larger numbers of rows.

Comments closed

Running SSMS As A Different User

Published 2018-05-11 by Kevin Feasel

Greg Low shows how to run SQL Server Management Studio as a different Windows user:

Now if all you want to do is to use a SQL Server login, then that’s easy. When you connect to a server in Object Explorer, or when you start a new Database Engine query, you can just choose SQL authentication instead.

But three other scenarios commonly occur.

If you need to run SSMS as an administrator on a machine with UAC, you can do this:

You right-click the link to SSMS and choose Run as administrator.

I will most commonly use option number three, when somebody gives me AD credentials for a separate server but my laptop is not a member of that domain.

Comments closed

An RDS Postmortem

Published 2018-05-10 by Kevin Feasel

Andy Isaacson covers a performance issue Honeycomb experienced with RDS:

In retrospect, the failure chain had just 4 links:

The RDS MySQL database instance backing our production environment experienced a sudden and massive performance degradation; P95(query_time) went from 11 milliseconds to >1000 milliseconds, while write-operation throughput dropped from 780/second to 5/second, in just 20 seconds.
RDS did not identify a failure, so the Multi-AZ feature did not activate to fail over to the hot spare.
As a result of the delays due to increased query_time, the Go MySQL client’s connection pool filled up with connections waiting for slow query results to come back and opened more connections to compensate.
This exceeded the max_connections setting on the MySQL server, leading to cron jobs and newly-started daemons being unable to connect to the database and triggering many “Error 1040: Too many connections” log messages.

This was very interesting to read, and I applaud companies making public these kinds of post-mortems, especially because the idea of publicizing the reasons for failures is so scary.

Comments closed

Machine Learning From Kafka

Published 2018-05-10 by Kevin Feasel

Kai Waehner has a post covering a recent talk he did on using Kafka as a data source for neural networks:

This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem – and why this is a great fit for machine learning at extreme scale.

The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.

Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.

Check out the slide deck as well for more details.

Comments closed

Overfitting With Polynomial Regression

Published 2018-05-10 by Kevin Feasel

Vincent Granville shows us a few problems with polynomial regression:

Even if the function to be estimated is very smooth, due to machine precision, only the first three or four coefficients can be accurately computed. With infinite precision, all coefficients would be correctly computed without over-fitting. We first explore this problem from a mathematical point of view in the next section, then provide recommendations for practical model implementations in the last section.

This is also a good read for professionals with a math background interested in learning more about data science, as we start with some simple math, then discuss how it relates to data science. Also, this is an original article, not something you will learn in college classes or data camps, and it even features the solution to a linear regression involving an infinite number of variables.

Granville’s point that overfitting is a relatively small concern is rather interesting. But the advice to avoid polynomial regression is generally pretty solid.

Comments closed

Building Flow Charts In R

Published 2018-05-10 by Kevin Feasel

Alan Haynes shows how to build flow charts in R using the grid Gmisc packages:

Flow charts are an important part of a clinical trial report. Making them can be a pain though. One good way to do it seems to be with the grid and Gmisc packages in R. X and Y coordinates can be designated based on the center of the boxes in normalized device coordinates (proportions of the device space – 0.5 is this middle) which saves a lot of messing around with corners of boxes and arrows.

A very basic flow chart, based very roughly on the CONSORT version, can be generated as follows…

Click through for sample code and a resulting image. H/T R-bloggers

Comments closed

Building Palettes From Pictures In R

Published 2018-05-10 by Kevin Feasel

Andrea Cirillo takes inspiration from the great works to build palettes:

If you see this painting you will find a profound of colours with a great equilibrium between different hues, the hardy usage of complementary colours and the ability expressed in the “chiaroscuro” technique. While I was looking at the painting I started, wondering how we moved from this wisdom to the ugly charts you can easily find within today’s corporate reports ( find a great sample on the WTF visualization website)

This is where Paletter comes from: bring the Renaissance wisdom and beauty within the plots we produce every day.

Introducing paletter

PaletteR is a lean R package which lets you draw from any custom image an optimized palette of colours. The package extracts a custom number of representative colours from the image. Let’s try to apply it on the “Vergine con il Bambino, angeli e Santi” before looking into its functional specification.

It’s an interesting package. I’ll have to play around with it.

Comments closed

SSMS 17.7 Released

Published 2018-05-10 by Kevin Feasel

Alan Yu announces SQL Server Management Studio 17.7:

In addition to enhancements and bug fixes, SSMS 17.7 comes with several exciting new features:

Support package scheduling in Azure-SSIS integration runtime.
Support for SSIS package scheduling in SQL Agent on SQL Managed instance. It is now possible to create SQL Agent jobs to execute SSIS packages on the managed instance.
Replication monitor now supports registering a listener for scenarios where publisher database and/or distributor database is part of Availability Group. So with this release of SSMS, you can monitor replication environments where publisher database and/or distribution database is part of Always On.

There are also several bugfixes that they call out.

Comments closed

Not-So-Trivial Plans And Filtered Indexes

Published 2018-05-10 by Kevin Feasel

Erik Darling continues his research into trivial plans and parameterization:

No. Really?

The plan is still showing us a warning, even though we see a literal in the cache.

This is obviously wrong. And very confusing.

Read on for the answer.

Comments closed