Press "Enter" to skip to content

Curated SQL Posts

Pretty R Plots

Simon Jackson has a couple posts on how to use ggplot2 to make graphs prettier.  First, histograms:

Time to jazz it up with colour! The method I’ll present was motivated by my answer to this StackOverflow question.

We can add colour by exploiting the way that ggplot2 stacks colour for different groups. Specifically, we fill the bars with the same variable (x) but cut into multiple categories:

Then he follows up with scatter plots:

Shape and size

There are many ways to tweak the shape and size of the points. Here’s the combination I settled on for this post:

There are some nice tricks here around transparency, color scheme, and gradients, making it a great series.  As a quick note, this color scheme in the histogram headliner photo does not work at all for people with red-green color-blindness.  Using a URL color filter like Toptal’s is quite helpful in discovering these sorts of issues.

Comments closed

Designing A Data Warehouse Test Plan

Koos van Strien walks through some of the high-level concepts when automating data warehouse tests:

In my current project, I’ve got a database containing everything to perform these tests:

  • Tables with identical structure to the ones in the staging area (plus two columns “TestSuiteName” and “TestName”)
  • A table containing the mapping from test-input table to target database, schema and table
  • A stored procedure to purge the DWH (all layers) in the test environment
  • A stored procedure to insert the data for a specific testsuite / name

When preparing a specific test case (the “insert rows for test case” step from the diagram above), the rows needed for that case are copied into the DWH:

Testing warehouses is certainly not a trivial exercise but given how complex warehouse ETL tends to be, having good tests reduces the number of 3 AM pages.

Comments closed

Seeing Statistics In Execution Plans

Pedro Lopes announces that the statistics used to compile a plan are now available as part of the execution plan details:

OptimizerStatsUsage is available in cached plans, so getting the “estimated execution plan” and the “actual execution plan” will have this information.

In the above example, I see the ModificationCount is very high (almost as much as the table cardinality itself) which after closer observation, the statistic had been updated with NORECOMPUTE.

And looking and the Seek itself, there is a large skew between estimated and actual rows. In this case, I now know a good course of action is to update statistics. Doing so produces this new result: ModificationCounter is back to zero and estimations are now correct.

This will be a good addition to SQL Server 2017.

Comments closed

Joining Availability Groups

Chris Lumnah troubleshoots an error in automatic seeding of an Availability Group:

In my lab, I decided to play around with the automatic seeding functionality that is part of Availability Groups. This was sparked by my last post about putting SSISDB into an AG. I wanted to see how it would work for a regular database. When I attempted to do so, I received the following error:

Cannot alter the availability group ‘Group1’, because it does not exist or you do not have permission. (Microsoft SQL Server, Error: 15151)

Read on for the answer; it turns out automatic seeding itself was not the culprit.

Comments closed

Disabling Nested Loop Join Optimization

Dmitry Pilugin explains the differences between trace flag 2340 and the DISABLE_OPTIMIZED_NESTED_LOOP query hint:

This optimization provides a great boost with a sufficient number of rows. You can read more about its test results in the blog OPTIMIZED Nested Loops Joins, created by Craig Freedman, an optimizer developer.

However, if the actual number of rows is less than the expected one, then CPU additional costs to build this sort may hide its benefits, increase CPU consumption and reduce its performance.

Read the whole thing.  I think the likelihood of using either this hint or the trace flag is near nil, but crazy things do come up.

Comments closed

Multi-Statement Functions

Erik Darling has started looking at interleaved execution of multi-statement table-valued functions in SQL Server 2017.  First, he gives an intro:

In the first plan, the optimizer chooses the ColumnStore index over the nonclustered index that it chose in compat level 130.

This plan is back to where it was before, and I’m totally cool with that. Avoiding bad choices is just as good as making good choices.

I think. I never took an ethics class, so whatever.

In part deux, Erik compares interleaved multi-statement functions to in-line table-valued functions:

In this case, the inline table valued function wiped the floor with the MSTVF, even with Interleaved Execution.

Obviously there’s overhead dumping that many rows into a table variable prior to performing the join, but hey, if you’re dumping enough rows in a MSTVF to care about enhanced cardinality estimation…

Just like Global Thermonuclear War, I believe the best way to win mutli-statement versus inline TVFs is not to play at all.

Comments closed

Apache Solr Backup And Recovery

Hrishikesh Gadre shows how to back up indexes in Apache Solr:

The backup mechanism allows an administrator to create a physically separate copy of index files and configuration metadata for a Solr collection. Any subsequent change to a Solr collection state (e.g. removing documents, deleting index files or changing collection configuration) has no impact on the state of this backup. As part of disaster recovery, the restore operation creates a new Solr collection and initializes it to the state represented by a Solr collection backup.

It’s probably safest to treat data in Solr as secondary data, in the sense that you should be able to rebuild the entire data set from scratch instead of Solr being a primary data store.  I’m not a big fan of the author using the term “disaster recovery” instead of just “recovery” or “backup restoration” (as they’re different concepts), but it’s worth the read.

Comments closed

Getting Get-Help Help

Shane O’Neill troubleshoots a problem and explains how helpful Get-Help can be in the process:

Why does help exist?

When you think about it, why is there even a function called help?
As far as I’m aware it’s basically the same as Get-Help except it automatically pipes the output to | more so we get pages rather than a wall of text.

Is there more that we can do with Get-Help though? Is there a way that we can return the examples only? Syntax only? Parameters only?

Is there not a way that we can do such things?!

Read on to find out if there is.

Comments closed

SQL Server Updates For MacOS

Meet Bhagdev has a couple MacOS-related announcements for SQL Server.  First, the SQL Server team has released command line tools:

We are delighted to share the production-ready release of the SQL Server Command Line Tools (sqlcmd and bcp) on macOS El Capitan and Sierra.

The sqlcmd utility is a command-line tool that lets you submit T-SQL statements or batches to local and remote instances of SQL Server. The utility is extremely useful for repetitive database tasks such as batch processing or unit testing.

The bulk copy program utility (bcp) bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format. The bcp utility can be used to import large numbers of new rows into SQL Server tables or to export data out of tables into data files.

Second, there’s a new ODBC driver available:

  • Azure AD support – You can now use Azure AD authentication (username/password) to centrally manage identities of database users and as an alternative to SQL Server authentication.

  • Always Encrypted support – You can now use Always Encrypted. Always Encrypted lets you transparently encrypt the data in the application, so that SQL Server will only handle the encrypted data and not plaintext values. Even if the SQL instance or the host machine is compromised, an attacker gets ciphertext of the sensitive data.

  • Table Valued Parameters (TVP) support – TVP support allows a client application to send parameterized data to the server more efficiently by sending multiple rows to the server with a single call. You can use the ODBC Driver 13.1 to encapsulate rows of data in a client application and send the data to the server in a single parameterized command.

Multi-platform is the catchword of the day.  If you’re a MacOS user, this might be a portent of things to come.

Comments closed

Pester For Presentations

Rob Sewell takes Pester to the edge:

If you have PowerShell version 5 then you will have Pester already installed although you should update it to the latest version. If not you can get Pester from the PowerShell Gallery follow the instructions on that page to install it. This is a good post to start learning about Pester

What can you test? Everything. Well, specifically everything that you can write a PowerShell command to check. So when I am setting up for my presentation I check the following things. I add new things to my tests as I think of them or as I observe things that may break my presentations. Most recently that was ensuring that my Visual Studio Code session was running under the correct user. I did that like this

Rob’s scenario is around giving presentations, but while reading this, think about those services which should be running on your SQL Server instance—the same concept applies.

Comments closed