Press "Enter" to skip to content

Curated SQL Posts

SQL Server Query Options

John Morehouse shows how to set your query options to be the same as that problematic application which is generating nasty queries:

Have you every executed a query in SQL Server Management Studio, looked at the execution plan, and noticed that it was a different plan than what was generated on the server?

A potential reason for this could be a different option settings.  The options represent the SET values of the current session.  SET options can affect how the query is execute thus having a different execution plan.   You can find these options in two places within SSMS under Tools -> Options -> Query Execution -> SQL Server -> Advanced.

Click through for lots of information including a script John provided to see which options are currently on.

Comments closed

R Is Bad For You?

Bill Vorhies lays out a controversial argument:

I have been a practicing data scientist with an emphasis on predictive modeling for about 16 years.  I know enough R to be dangerous but when I want to build a model I reach for my SAS Enterprise Miner (could just as easily be SPSS, Rapid Miner or one of the other complete platforms).

The key issue is that I can clean, prep, transform, engineer features, select features, and run 10 or more model types simultaneously in less than 60 minutes (sometimes a lot less) and get back a nice display of the most accurate and robust model along with exportable code in my selection of languages.

The reason I can do that is because these advanced platforms now all have drag-and-drop visual workspaces into which I deploy and rapidly adjust each major element of the modeling process without ever touching a line of code.

I have almost exactly the opposite thought on the matter:  that drag-and-drop development is intolerably slow; I can drag and drop and connect and click and click and click for a while, or I can write a few lines of code.  Nevertheless, I think Bill’s post is well worth reading.

Comments closed

Pretty R Plots

Simon Jackson has a couple posts on how to use ggplot2 to make graphs prettier.  First, histograms:

Time to jazz it up with colour! The method I’ll present was motivated by my answer to this StackOverflow question.

We can add colour by exploiting the way that ggplot2 stacks colour for different groups. Specifically, we fill the bars with the same variable (x) but cut into multiple categories:

Then he follows up with scatter plots:

Shape and size

There are many ways to tweak the shape and size of the points. Here’s the combination I settled on for this post:

There are some nice tricks here around transparency, color scheme, and gradients, making it a great series.  As a quick note, this color scheme in the histogram headliner photo does not work at all for people with red-green color-blindness.  Using a URL color filter like Toptal’s is quite helpful in discovering these sorts of issues.

Comments closed

Designing A Data Warehouse Test Plan

Koos van Strien walks through some of the high-level concepts when automating data warehouse tests:

In my current project, I’ve got a database containing everything to perform these tests:

  • Tables with identical structure to the ones in the staging area (plus two columns “TestSuiteName” and “TestName”)
  • A table containing the mapping from test-input table to target database, schema and table
  • A stored procedure to purge the DWH (all layers) in the test environment
  • A stored procedure to insert the data for a specific testsuite / name

When preparing a specific test case (the “insert rows for test case” step from the diagram above), the rows needed for that case are copied into the DWH:

Testing warehouses is certainly not a trivial exercise but given how complex warehouse ETL tends to be, having good tests reduces the number of 3 AM pages.

Comments closed

Seeing Statistics In Execution Plans

Pedro Lopes announces that the statistics used to compile a plan are now available as part of the execution plan details:

OptimizerStatsUsage is available in cached plans, so getting the “estimated execution plan” and the “actual execution plan” will have this information.

In the above example, I see the ModificationCount is very high (almost as much as the table cardinality itself) which after closer observation, the statistic had been updated with NORECOMPUTE.

And looking and the Seek itself, there is a large skew between estimated and actual rows. In this case, I now know a good course of action is to update statistics. Doing so produces this new result: ModificationCounter is back to zero and estimations are now correct.

This will be a good addition to SQL Server 2017.

Comments closed

Joining Availability Groups

Chris Lumnah troubleshoots an error in automatic seeding of an Availability Group:

In my lab, I decided to play around with the automatic seeding functionality that is part of Availability Groups. This was sparked by my last post about putting SSISDB into an AG. I wanted to see how it would work for a regular database. When I attempted to do so, I received the following error:

Cannot alter the availability group ‘Group1’, because it does not exist or you do not have permission. (Microsoft SQL Server, Error: 15151)

Read on for the answer; it turns out automatic seeding itself was not the culprit.

Comments closed

Disabling Nested Loop Join Optimization

Dmitry Pilugin explains the differences between trace flag 2340 and the DISABLE_OPTIMIZED_NESTED_LOOP query hint:

This optimization provides a great boost with a sufficient number of rows. You can read more about its test results in the blog OPTIMIZED Nested Loops Joins, created by Craig Freedman, an optimizer developer.

However, if the actual number of rows is less than the expected one, then CPU additional costs to build this sort may hide its benefits, increase CPU consumption and reduce its performance.

Read the whole thing.  I think the likelihood of using either this hint or the trace flag is near nil, but crazy things do come up.

Comments closed

Multi-Statement Functions

Erik Darling has started looking at interleaved execution of multi-statement table-valued functions in SQL Server 2017.  First, he gives an intro:

In the first plan, the optimizer chooses the ColumnStore index over the nonclustered index that it chose in compat level 130.

This plan is back to where it was before, and I’m totally cool with that. Avoiding bad choices is just as good as making good choices.

I think. I never took an ethics class, so whatever.

In part deux, Erik compares interleaved multi-statement functions to in-line table-valued functions:

In this case, the inline table valued function wiped the floor with the MSTVF, even with Interleaved Execution.

Obviously there’s overhead dumping that many rows into a table variable prior to performing the join, but hey, if you’re dumping enough rows in a MSTVF to care about enhanced cardinality estimation…

Just like Global Thermonuclear War, I believe the best way to win mutli-statement versus inline TVFs is not to play at all.

Comments closed

Apache Solr Backup And Recovery

Hrishikesh Gadre shows how to back up indexes in Apache Solr:

The backup mechanism allows an administrator to create a physically separate copy of index files and configuration metadata for a Solr collection. Any subsequent change to a Solr collection state (e.g. removing documents, deleting index files or changing collection configuration) has no impact on the state of this backup. As part of disaster recovery, the restore operation creates a new Solr collection and initializes it to the state represented by a Solr collection backup.

It’s probably safest to treat data in Solr as secondary data, in the sense that you should be able to rebuild the entire data set from scratch instead of Solr being a primary data store.  I’m not a big fan of the author using the term “disaster recovery” instead of just “recovery” or “backup restoration” (as they’re different concepts), but it’s worth the read.

Comments closed

Getting Get-Help Help

Shane O’Neill troubleshoots a problem and explains how helpful Get-Help can be in the process:

Why does help exist?

When you think about it, why is there even a function called help?
As far as I’m aware it’s basically the same as Get-Help except it automatically pipes the output to | more so we get pages rather than a wall of text.

Is there more that we can do with Get-Help though? Is there a way that we can return the examples only? Syntax only? Parameters only?

Is there not a way that we can do such things?!

Read on to find out if there is.

Comments closed