Press "Enter" to skip to content

Curated SQL Posts

Scraping SQL Saturday Statistics

Tomaz Kastrun shows how to use rvest to read the SQL Saturday website and parse schedule details:

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

Read on for the code and some analysis.

Comments closed

ROWGROUP_FLUSH Deadlocks Inserting Into Clustered Columnstore Indexes

Joe Obbish ran into a strange deadlock when performing concurrent insertions into a clustered columnstore index:

We’ve only observed this deadlock with multiple concurrent sessions insert to the delta store for the same target CCI due to server memory pressure or very low cardinality estimates (less than 251 rows). The correct mitigation depends on why you’re seeing the issue in the first place. If you’re seeing it due to low cardinality estimates then fix your estimates, or at the very least get them above 250 rows. If you’re seeing them because the memory grant for the CCI build times out after 25 seconds then lower concurrency or increase server memory.

The problem can also be avoided by not doing concurrent inserts in the first place. In some cases a parallel insert may be a reasonable alterative. There’s also some evidence that the deadlock is only seen when the number of rows for insert is very close to 1048576, but we weren’t able to make any definitive conclusions around that.

Read the whole thing.  Also check out his Connect item.

Comments closed

Making CXPACKET The Bad Guy

Pedro Lopes goes into detail regarding the new CXCONSUMER wait stat:

It’s important to mention a couple notes before getting into details:

  1. Parallelism use is intrinsically a benefit for costly queries, bound by the Cost Threshold for Parallelism and Max Degree of Parallelismserver configurations among other considerations. See the Query Processing Guide for more in-depth information on how SQL Server implements parallelism. Also refer to Craig Freedman’s blog series on the topic.
  2. Waits are a normal part of the waits and queues model, allowing SQL Server to concurrently execute many more requests than there are schedulers available. See more about waits and queues here,

It’s also important to understand that parallelism is implemented as if it were two operators. There’s the producers that push data to consumers, and the consumers that may have to wait for data from producers.

And so it becomes easier to understand how producer waits are the ones that may require attention, while consumer waits are inevitable as a passive consequence of longer running producers.

I actually would have preferred the rename go the other way and create a CXPRODUCER which is bad, rather than a CXCONSUMER which is benign.  The problem is that if you’re running in a heterogeneous environment, you have to remember that CXPACKET on Server A might indicate a problem, whereas it’s much fuzzier on Server B because you haven’t upgraded Server B to the latest CU yet.  Regardless, I’m happy that we’re going to be able to have a better idea when parallelism is actually a problem.

Comments closed

JSON Parsing Performance

Jovan Popovic has a couple posts showing JSON parsing performance in SQL Server 2017.  First up, he makes the statement that JSON parsing can be 10x faster than XML:

I’m running 10000 iterations and getting the average spent time.

Average time to parse JSON and get the value using JSON_VALUE() function is around 3 microseconds while the equivalent action with XML typed variable and value() method takes between 30 and 40 microseconds.

This shows that parsing JSON “plain text” is 10x faster than parsing strongly typed XML variable.

He also compares JSON parsing to spatial type parsing:

I’m running 100.000 iterations and getting the average spent time in both cases. The results are:

  1. Spatial type needs around 11-14 microseconds to get the X coordinate
  2. JSON parser needs around 1-2 microseconds to get the value from X property (including CAST to float)

We can see that parsing JSON text is much faster than the equivalent operation in Spatial type.

JSON uses NVARCHAR type that is the most optimized type in SQL Server because it is used in most of the queries in SQL Server. JSON parser is based on T-SQL parser that is also very optimized. Spatial has a type, but it is still some binary serialized object that needs to be deserialized when we need to access the fields. In this case, scanning NVARCHAR is much faster than deserializing binary data into spatial.

On this side, Jovan does note that there are benefits to using spatial types, like performing spatial calculations.

Comments closed

Upgrading A Cluster To Windows Server 2016

Ryan Adams shows how to upgrade a failover cluster running Windows Server 2012 R2 to Windows Server 2016 without having to start from scratch:

Starting in Windows Server 2012 R2 you now have a way to upgrade a cluster to Windows 2016.  The best part is it’s not an OS upgrade, but a rebuild.  The magic is that you can join a Windows 2016 server to a Windows 2012 R2 cluster.  You can upgrade your cluster with as little as one failover and thus very little down time.  Everything stays in compatibility mode until all nodes are upgraded to Windows 2016 and then you upgrade the cluster functional level.  This is great news for those of us running FCIs or AGs.

Click through for a listing of steps and a video.

Comments closed

The Magic Of Sampling

Nathan LeClaire reminds us of an important story that statisticians have been telling us for a couple centuries:

It starts slowly. Maybe your home-grown centralized logging cluster becomes more difficult to operate, demanding unholy amounts of engineer time every week. Maybe engineers start to find that making a query about production is a “go get a coffee and come back later” activity. Or maybe monitoring vendors offer you a quote that elicits a response ranging anywhere from curses under the breath to blood-curdling screams of terror.

The multi-headed beast we know as Scale has reared its ugly visage.

As some of you may have already guessed from the title, I’m going to discuss one way to solve this problem, and why it might not be as bad as you might think.

Take some of your precious information and throw it in the garbage. In lots of cases, you can just drop those writes on the floor as long as your observability stack is equipped to handle it.

In other words, sample.

Read on for a couple of methods.  One thing I’ve taken a fancy to is collecting the first N of a particular type of message and keeping track of how often that message appears.  If you get the same error for every row in a file, then you might really only need to see that one time and the number of times it happened.  Or maybe you want to see a few of them to ensure that they’re really the same error and not two separate errors which are getting reported together due to insufficient error separation.

Comments closed

Quickly Computing Area Under The Curve

Jean-Francois Puget has a fast method for computing Area Under the Curve in Python:

When the target only takes two values we have a binary classification problem at hand.  Example of binary classification are very common. For instance fraud detection where examples are credit card transactions, features are time, location, amount, merchant id, etc., and target is fraud or not fraud.  Spam detection is also a binary classification where examples are emails, features are the email content as a string of words, and target is spam or not spam.  Without loss of generality we can assume that the target values are 0 and 1, for instance 0 means no fraud or no spam, whiloe 1 means fraud or spam.

For binary classification, predictions are also binary.  Therefore, a prediction is either equal to the target, or is off the mark.  A simple way to evaluate model performance is accuracy: how many predictions are right? For instance, if our test set has 100 examples in it, how many times is the prediction correct?  Accuracy seems a logical way to evaluate performance: a higher accuracy obviously means a better model.  At least this is what people think when they are exposed to the first time to binary classification problems.  Issue is that accuracy can be extremely misleading.

Read Jean-Francois’ explanation and scroll down for the Python sample.

Comments closed

Spotting SQL Agent Job Differences

Rob Sewell shows us how to compare SQL Agent jobs across Availability Group replicas:

On the plane home from PAS Summit I was sat next to someone who had also attended and when he saw on my laptop that I was part of the SQL Community we struck up a conversation. He asked me how he could compare SQL Agent Jobs across availability group replicas to ensure that they were the same.

He already knew that he could use Copy-DbaAgentJob from dbatools to copy the jobs between replicas and we discussed how to set up an Agent job to accomplish this. The best way to run an Agent Job with a PowerShell script is described here

Maybe I’m out in la-la land, but wouldn’t this be a perfect use case for MSX/TSX?

Comments closed

Offline Installation Of SQL Server 2017 ML Services

Jan Mulkens shows how to install SQL Server 2017 Machine Learning Services when your the server hosting SQL Server doesn’t have outbound internet access:

That’s when you remember it… Your server isn’t connected to the internet!
Pretty normal, but in your enthusiasm you completely forgot that SQL Server needs to download some binaries for the R and Python components you so desperately want on your precious machine!

Luckily, the installer comes to your rescue and shows you where to download those binaries it needs.
Turns out however… This link only is for one R component and the installer won’t let you pass to the next screen!

Read on for the answer.

Comments closed

Actual Execution Plan Enhancements

Pedro Lopes points out some additional data available in the properties section when you generate an actual execution plan:

Looking at the actual execution plan is one of the most used performance troubleshooting techniques. Having information on elapsed CPU time and overall execution time, together with session wait information in an actual execution plan allows a DBA to use showplan to troubleshoot issues away from the server, and be able to correlate and compare different types of waits that result from query or schema changes.

A few months ago we had introduced exposed in SSMS some of the per-operator statistics, such as CPU and elapsed time per thread. More recently, we have introduced overall query CPU and elapsed time tracking for statistics showplan xml (both in ms). These can be found in the root node of an actual plan. Available using the latest versions of SSMS v17, when used with SQL Server 2012 SP4SQL Server 2016 SP1 and SQL Server 2017. For SQL Server 2014 it will become available in a future Service Pack.

Also be sure to check out Geoff Patterson’s Connect item asking that the execution plan results show the top ten waits in descending order rather than ascending order.  That’s the appropriate ordering in my mind:  show me the most important things first.

Comments closed