Press "Enter" to skip to content

Author: Kevin Feasel

Reverse Engineering The Stream Aggregate Algorithm

Itzik Ben-Gan has started a series of articles on optimizing queries which use grouping and aggregating with a reverse-engineering of the stream aggregate algorithm:

As you may already know, when SQL Server optimizes a query, it evaluates multiple candidate plans, and eventually picks the one with the lowest estimated cost. The estimated plan cost is the sum of all the operators’ estimated costs. In turn, each operator’s estimated cost is the sum of the estimated I/O cost and estimated CPU cost. The cost unit is meaningless in its own right. Its relevance is in the comparison that the optimizer makes between candidate plans. That is, the costing formulas were designed with the goal that, between candidate plans, the one with the lowest cost will (hopefully) represent the one that will finish more quickly. A terribly complex task to do accurately!

The more the costing formulas adequately take into account the factors that truly affect the algorithm’s performance and scaling, the more accurate they are, and the more likely that given accurate cardinality estimates, the optimizer will choose the optimal plan. At any rate, if you want to understand why the optimizer chooses one algorithm versus another you need to understand two main things: one is how the algorithms work and scale, and another is SQL Server’s costing model.

So back to the plan in Figure 1; let’s try and understand how the costs are computed. As a policy, Microsoft will not reveal the internal costing formulas that they use. When I was a kid I was fascinated with taking things apart. Watches, radios, cassette tapes (yes, I’m that old), you name it. I wanted to know how things were made. Similarly, I see value in reverse engineering the formulas since if I manage to predict the cost reasonably accurately, it probably means that I understand the algorithm well. During the process you get to learn a lot.

Our query ingests 1,000,000 rows. Even with this number of rows, the I/O cost seems to be negligible compared to the CPU cost, so it is probably safe to ignore it.

As for the CPU cost, you want to try and figure out which factors affect it and in what way.

I give this my highest recommendation.

Comments closed

SQL Server 2017 On Linux: The Azure VM Method

Prashanth Jayaram shows how to spin up an Azure VM running SQL Server 2017 on Linux:

To create the VMs, you need to go through these four steps:

  1. Basics to configure basic setting of the VM

  2. Size to choose the VM machine size

  3. Settings to configure the features. In this case, the default values are used. You just need to click the Next button to proceed further

  4. Purchase

Once it’s running, Prashanth shows how to connect via PuTTY and configure the service.

Comments closed

Retrieving Server And Database Permissions

Kenneth Fisher wants you to know about your SQL Server’s permission setup:

Our host for T-SQL Tuesday this month is Jens Vestergaard (b/t) and he has asked about our favorite SSMS tool. My initial thought was to talk about using solutions in SSMSbut I’d already written about that. My next thought was to write about sp_DBPermissions and sp_SrvPermissionswhich of course I’ve written about several times. No big surprise, I wrote them after all.

So what tool am I going to write about? Well, sp_DBPermissions and sp_SrvPermissions of course. I mean I did write them after all, and Jens did say we could brag about something we wrote :).

Read on for more information about these useful tools.

Comments closed

Tracking Powershell Command Execution Time

Constantine Kokkinos shows how to track time spent on the last command in Powershell:

You can select any property from the output and get just the TotalSeconds, but I like this simple output for when I have to leave some work in progress and I need to come back and check some time in the future.

If you are confused by this code and want further explanations, keep reading!

That’s a lot simpler than the “classic” .NET way of setting up a StopWatch and tracking changes.

Comments closed

dbatools: The Swiss Army Knife For DBAs

Jess Pomfret uses this T-SQL Tuesday to cover some of her favorite cmdlets in dbatools:

Test-DbaSqlBuild

When I found this command I couldn’t have been more excited. My day-to-day job requires the care and watering of over 100 SQL Server instances of varying versions.  Using this command you can get the current build of all your instances and then compare that to the most recent available.  There are also parameters for how far you want to be from the latest version. Setting the -latest switch means just that, your server will only be seen as compliant if it’s on the latest release, passing in -1CU means that it can be no more than 1 cumulative update behind.

Read on for a few additional useful cmdlets.  Out of a large number of useful cmdlets.

Comments closed

Precision And Recall

Brian Lee Yung Rowe makes the important point that model accuracy is not always the ultimate measure:

Now, AI companies are obliged to tell you how great their model is. They may say something like “our model is 95% accurate”. Zowee! But what does this mean exactly? In terms of binary classification it means that the model chose the correct class 95% of the time. This seems pretty good, so what’s the problem?

Suppose I create an AI that guesses the gender of a technical employee at Facbook. As of 2017, 19% of STEM roles are held by women. Behind the scenes, my model is really simple: it just chooses male every time (bonus question: is this AI?). Because of the data, my model will be 81% accurate. Now 95% doesn’t seem all that impressive. This dataset is known to be unbalanced, because the classes are not proportional. A better dataset would have about 50% women and 50% men. So asking if a dataset is balanced helps to identify some tricks that make models appear smarter than they are.

With wildly unbalanced data (like diagnosing rare diseases), measures like positive predictive value are far more important than overall accuracy.

Comments closed

Online Tools For Data Professionals

Cathrine Wilhelmsen lists five interesting online tools:

When I need to quickly create smaller sets of test data or dummy data, I use Mockaroo. It is highly configurable with over 140 built-in field types for locations, personal information, product information, technical information and much more. Every field type can be customized, and you can also use your own regular expression to generate data. The data can then be exported to CSV, JSON, SQL, and Excel formats. The interface is simple to use and understand, and you can save your schemas and data sets for later reuse.

I’m fond of Coblis and was aware of the last two, but the first two were new to me.

Comments closed

Faceting With R And SQL Server ML Services

Marlon Ribunal has a quick example showing how to build faceted plots with SQL Server ML Services and ggplot2:

In my previous post, I have demonstrated how easy it is to create a bar graph in SQL Server 2017 In-Database Machine Learning using  R.

We’re going to build upon that basic graph.

Sometimes doing data analysis would require us to look at an overview of our data across specific partitions, say a year. For example, we want to see how our product groups fare on month-to-month basis across the last 4 years.

In a data analytics perspective, there are quite a handful of data points in this requirement – data aggregate (quantity), monthly periods, and year partitions.

One of the approaches to handle such requirement is by using a facet. Faceting is a way of plotting subsets of data into a matrix of panels based on one or more variables – or facets.

Click through for the example and code.  Facets are quite useful, but they run the risk of misleading if you squeeze too many onto the screen.  The same line can look quite different with a “tall” facet versus a “wide” facet, and that can change how people interpret your visual.

Comments closed

Tools For Various SQL Server Stacks

Warren Estes breaks out some tooling recommendations by stacks—that is, common use cases:

The Admin stack is probably the most important stack here. You still using maintenance tasks via SSMS? stop doing that. Rebuilding indexes every night? Maybe rethink that.

How you keep track of, monitor and do basics DBA tasks?

CMS server 
Ok so this can involve SSMS, but a feature not a lot of people may not use. We use it to keep track of all of our instances and push things..oh baby baaaby!  It also allows me to combine PoSh to do work against instances, gather data (historical, dmv…etc) and do a boat load of admin stuff without pointing and clicking. Heck I don’t even have to open SSMS to use my CMS server at all.

SentryOne SQLSentry
SQLSentry can automatically defrag indexes for you and update stats. You could use this instead of the below choices for this aspect if desired. Although not free, it’s an option we have in our environment and I love me some options.

Ola hallengren/Minionware
Both amazing options for backup, reindexing and checkdb. Although most places i’ve worked use Ola’s scripts by default. HOWEVER…. Minion has some pretty nice options that are FAR more configurable than Ola’s. We have mitigated some large DB issues by rolling our own code on top of Ola’s scripts. We could avoid this by simply using Minionware!

I’m a huge fan of the Minionware suite.  And several other things Warren mentions.

Comments closed

Restoration And That CHECKDB Message

Mike Fal investigates an interesting message in the SQL Server error log after a database restoration:

Recently I was doing some work with a friend around some database restores. It was pretty routine stuff. However, after one restore my friend came across something in the SQL Error Log that caught him by surprise. As part of the restore, there was a CHECKDB message for the restored database:

My friend’s first reaction was “why is SQL Server doing a DBCC CHECKDB as part of the restore?” He was concerned, because CHECKDB is a pretty hefty operation and this could really impact the restore time if he had to wait on a CHECKDB to complete. But the other confusing thing was that the date for the CHECKDB didn’t match up with the restore timing.

Click through to learn the answer.

Comments closed