Press "Enter" to skip to content

Curated SQL Posts

Scientific Debt

David Robinson gives us the data science analog to technical debt:

In my new job as Chief Data Scientist at DataCamp, I’ve been thinking about the role of data science within a business, and discussing this with other professionals in the field. On a panel earlier this year, I realized that data scientists have a rough equivalent to this concept: “scientific debt.”

Scientific debt is when a team takes shortcuts in data analysis, experimental practices, and monitoring that could have long-term negative consequences. When you hear a statement like:

  • “We don’t have enough time to run a randomized test, let’s launch it”
  • “To a first approximation this effect is probably linear”
  • “This could be a confounding factor, but we’ll look into that later”
  • “It’s directionally accurate at least”

you’re hearing a little scientific debt being “borrowed”.

Read the whole thing.  I strongly agree with the premise.

Comments closed

Is Query Folding Taking Place?

Chris Webb has a quick way of seeing if query folding is taking place when importing data from Analysis Services into Power Query (either Power BI or Excel):

As a quick follow-on from last week’s post on how to detect whether query folding is taking place when importing from OData data sources, if you’re importing data from Analysis Services you have a similar problem: how do you know whether query folding is taking place? Ensuring that query folding takes place for as many of the steps in your query – especially those that filter or otherwise reduce the amount of data returned – is very important for data refresh performance.

Although the Power Query engine generates MDX queries when importing from Analysis Services in the same way it generates SQL queries when it imports from a relational database, the View Native Query option doesn’t work for Analysis Services data sources. You can of course use a Profiler trace or xEvents to see the MDX, but for most users that will not be an option for security reasons.

Read on for a better alternative.

Comments closed

Resetting Slices In Power BI With Bookmarks

Reza Rad shows how to use a bookmark to set the default state of a Power BI dashboard and return to it with a single button click:

Using bookmarks for clearing all slicers in Power BI is not a a new function, I have been using it for many months, and advising many people to do it that way. However, I still get a lot of questions in my presentations about how to do that. That is why I ended up writing this post. This post shows you a very quick trick of having a button to clear all slicers, and the magic is all happening with bookmarks. Bookmarks store the state of a Power BI page, and can be used in many scenarios, in this post, I only show you the ability to clear all slicers in a page. To learn more about Power BI; read Power BI book from Rookie to Rock Star.

Click through for an example of how to set this up.

Comments closed

Constraints On Temp Tables

Kenneth Fisher argues that you should use default naming for temp table constraints:

You should be able to create a #temp in every session. That’s the idea, right? It’s one of the things that differentiates a global temp table from a local temp table. But there can be some difficulties with that.

If you are working with reusable code that uses temp tables (a stored procedure for example), sometimes you need to create a constraint. The thing about constraints is that their names are just as unique as tables, stored procedures etc. i.e. the name of a constraint can only be used once. You can’t have two tables with the same constraint name. In fact, you can’t even have a constraint name that matches a table, stored procedure etc name.

There’s some solid advice in this post.

Comments closed

Docker Containers For Multiple Versions Of SQL Server

Rob Sewell shows us how to create Docker containers for multiple versions of SQL Server running on the same machine:

I want to show the two modules running against a number of SQL Versions so I have installed

  • 2 Domain Controllers
  • 2 SQL 2017 instances on Windows 2016 with an Availability Group and WideWorldImporters database
  • 1 Windows 2016 jump box with all the programmes I need
  • 1 Windows 2016 with containers

using a VSTS build and this set of ARM templates and scripts

I wanted to create containers running SQL2017, SQL2016, SQL2014 and SQL2012 and restore versions of the AdventureWorks database onto each one.

Rob shows how to do this all via Powershell so you can automate the process.

Comments closed

When Stream Aggregates Require Sorting

Itzik Ben-Gan continues his series on grouping and aggregation:

Let’s try and figure out the costing formula for the Sort operator. Remember, our focus is the estimated cost and scaling because our ultimate goal is to figure out optimization thresholds where the optimizer changes its choices from one strategy to another.

The I/O cost estimate seems to be fixed: 0.0112613. I get the same I/O cost irrespective of factors like number of rows, number of sort columns, data type, and so on. This is probably to account for some anticipated I/O work.

As for the CPU cost, alas, Microsoft doesn’t publicly expose the exact algorithms that they use for sorting. However, among the common algorithms used for sorting by database engines in general are different implementations of merge sort and quicksort. Thanks to efforts made by Paul White, who’s fond of looking at Windows debugger stack traces (not all of us have the stomach for this), we have a bit more insight into the topic, published in his series “Internals of the Seven SQL Server Sorts.” According to Paul’s findings, the general sort class (used in the above plan) uses merge sort (first internal, then transitioning to external). In average, this algorithm requires n log n comparisons to sort n items. With this in mind, it’s probably a safe bet as a starting point to assume that the CPU part of the operator’s cost is based on a formula such as:

Operator CPU cost = <startup cost> + @numrows * LOG(@numrows) * <comparison cost>

Of course, this could be an oversimplification of the actual costing formula that Microsoft uses, but absent any documentation on the matter, this is an initial best guess.

It’s interesting to see how the calculation changes in form with larger numbers of rows.

Comments closed

Running SSMS As A Different User

Greg Low shows how to run SQL Server Management Studio as a different Windows user:

Now if all you want to do is to use a SQL Server login, then that’s easy. When you connect to a server in Object Explorer, or when you start a new Database Engine query, you can just choose SQL authentication instead.

But three other scenarios commonly occur.

If you need to run SSMS as an administrator on a machine with UAC, you can do this:

You right-click the link to SSMS and choose Run as administrator.

I will most commonly use option number three, when somebody gives me AD credentials for a separate server but my laptop is not a member of that domain.

Comments closed

An RDS Postmortem

Andy Isaacson covers a performance issue Honeycomb experienced with RDS:

In retrospect, the failure chain had just 4 links:

  1. The RDS MySQL database instance backing our production environment experienced a sudden and massive performance degradation; P95(query_time) went from 11 milliseconds to >1000 milliseconds, while write-operation throughput dropped from 780/second to 5/second, in just 20 seconds.

  2. RDS did not identify a failure, so the Multi-AZ feature did not activate to fail over to the hot spare.

  3. As a result of the delays due to increased query_time, the Go MySQL client’s connection pool filled up with connections waiting for slow query results to come back and opened more connections to compensate.

  4. This exceeded the max_connections setting on the MySQL server, leading to cron jobs and newly-started daemons being unable to connect to the database and triggering many “Error 1040: Too many connections” log messages.

This was very interesting to read, and I applaud companies making public these kinds of post-mortems, especially because the idea of publicizing the reasons for failures is so scary.

Comments closed

Machine Learning From Kafka

Kai Waehner has a post covering a recent talk he did on using Kafka as a data source for neural networks:

This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem – and why this is a great fit for machine learning at extreme scale.

The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.

Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.

Check out the slide deck as well for more details.

Comments closed

Overfitting With Polynomial Regression

Vincent Granville shows us a few problems with polynomial regression:

Even if the function to be estimated is very smooth, due to machine precision, only the first three or four coefficients can be accurately computed. With infinite precision, all coefficients would be correctly computed without over-fitting. We first explore this problem from a mathematical point of view in the next section, then provide recommendations for practical model implementations in the last section.

This is also a good read for professionals with a math background interested in learning more about data science, as we start with some simple math, then discuss how it relates to data science. Also, this is an original article, not something you will learn in college classes or data camps, and it even features the solution to a linear regression involving an infinite number of variables.

Granville’s point that overfitting is a relatively small concern is rather interesting.  But the advice to avoid polynomial regression is generally pretty solid.

Comments closed