2018-05-10 – Curated SQL

An RDS Postmortem

Published 2018-05-10 by Kevin Feasel

Andy Isaacson covers a performance issue Honeycomb experienced with RDS:

In retrospect, the failure chain had just 4 links:

The RDS MySQL database instance backing our production environment experienced a sudden and massive performance degradation; P95(query_time) went from 11 milliseconds to >1000 milliseconds, while write-operation throughput dropped from 780/second to 5/second, in just 20 seconds.
RDS did not identify a failure, so the Multi-AZ feature did not activate to fail over to the hot spare.
As a result of the delays due to increased query_time, the Go MySQL client’s connection pool filled up with connections waiting for slow query results to come back and opened more connections to compensate.
This exceeded the max_connections setting on the MySQL server, leading to cron jobs and newly-started daemons being unable to connect to the database and triggering many “Error 1040: Too many connections” log messages.

This was very interesting to read, and I applaud companies making public these kinds of post-mortems, especially because the idea of publicizing the reasons for failures is so scary.

Comments closed

Machine Learning From Kafka

Published 2018-05-10 by Kevin Feasel

Kai Waehner has a post covering a recent talk he did on using Kafka as a data source for neural networks:

This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem – and why this is a great fit for machine learning at extreme scale.

The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.

Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.

Check out the slide deck as well for more details.

Comments closed

Overfitting With Polynomial Regression

Published 2018-05-10 by Kevin Feasel

Vincent Granville shows us a few problems with polynomial regression:

Even if the function to be estimated is very smooth, due to machine precision, only the first three or four coefficients can be accurately computed. With infinite precision, all coefficients would be correctly computed without over-fitting. We first explore this problem from a mathematical point of view in the next section, then provide recommendations for practical model implementations in the last section.

This is also a good read for professionals with a math background interested in learning more about data science, as we start with some simple math, then discuss how it relates to data science. Also, this is an original article, not something you will learn in college classes or data camps, and it even features the solution to a linear regression involving an infinite number of variables.

Granville’s point that overfitting is a relatively small concern is rather interesting. But the advice to avoid polynomial regression is generally pretty solid.

Comments closed

Building Flow Charts In R

Published 2018-05-10 by Kevin Feasel

Alan Haynes shows how to build flow charts in R using the grid Gmisc packages:

Flow charts are an important part of a clinical trial report. Making them can be a pain though. One good way to do it seems to be with the grid and Gmisc packages in R. X and Y coordinates can be designated based on the center of the boxes in normalized device coordinates (proportions of the device space – 0.5 is this middle) which saves a lot of messing around with corners of boxes and arrows.

A very basic flow chart, based very roughly on the CONSORT version, can be generated as follows…

Click through for sample code and a resulting image. H/T R-bloggers

Comments closed

Building Palettes From Pictures In R

Published 2018-05-10 by Kevin Feasel

Andrea Cirillo takes inspiration from the great works to build palettes:

If you see this painting you will find a profound of colours with a great equilibrium between different hues, the hardy usage of complementary colours and the ability expressed in the “chiaroscuro” technique. While I was looking at the painting I started, wondering how we moved from this wisdom to the ugly charts you can easily find within today’s corporate reports ( find a great sample on the WTF visualization website)

This is where Paletter comes from: bring the Renaissance wisdom and beauty within the plots we produce every day.

Introducing paletter

PaletteR is a lean R package which lets you draw from any custom image an optimized palette of colours. The package extracts a custom number of representative colours from the image. Let’s try to apply it on the “Vergine con il Bambino, angeli e Santi” before looking into its functional specification.

It’s an interesting package. I’ll have to play around with it.

Comments closed

SSMS 17.7 Released

Published 2018-05-10 by Kevin Feasel

Alan Yu announces SQL Server Management Studio 17.7:

In addition to enhancements and bug fixes, SSMS 17.7 comes with several exciting new features:

Support package scheduling in Azure-SSIS integration runtime.
Support for SSIS package scheduling in SQL Agent on SQL Managed instance. It is now possible to create SQL Agent jobs to execute SSIS packages on the managed instance.
Replication monitor now supports registering a listener for scenarios where publisher database and/or distributor database is part of Availability Group. So with this release of SSMS, you can monitor replication environments where publisher database and/or distribution database is part of Always On.

There are also several bugfixes that they call out.

Comments closed

Not-So-Trivial Plans And Filtered Indexes

Published 2018-05-10 by Kevin Feasel

Erik Darling continues his research into trivial plans and parameterization:

No. Really?

The plan is still showing us a warning, even though we see a literal in the cache.

This is obviously wrong. And very confusing.

Read on for the answer.

Comments closed

Saving A SQL Server Image In The Azure Container Repository

Published 2018-05-10 by Kevin Feasel

Andrew Pruski shows us how to store Docker images in Azure:

I’m going to push my custom dbafromthecold/sqlserverlinuxagent image. It’s a public image so if you want to use it, just run: –
docker pull dbafromthecold/sqlserverlinuxagent:latest
So similar to pushing to the Docker hub, we need to tag the image with the login server name that we retrieved a couple of commands ago and the name of the image: –
docker tag dbafromthecold/sqlserverlinuxagent apcontainerregistry01.azurecr.io/sqlserverlinuxagent:latest

Read on for a full example.

Comments closed

Getting Per-Table Space Utilization With Powershell

Published 2018-05-10 by Kevin Feasel

Drew Furgiuele provides us a script and a homework assignment:

Of course, PowerShell excels at this. By using the SQL Server module, it’s really easy to:

Connect to an instance and collect every user database, and

From each database, collect every table, and

For each table, collect row counts and space used, and

If there are any indexes, group them, and sum their usage and report that as well

Here’s the script. Note that I have the server name hard-coded in there as localhost (more on that in a coming paragraph). Go ahead and take a look before we break it down.

Click through for the script, and homework is due next Tuesday on his desk.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Day: May 10, 2018

Introducing paletter