Default Schemas in SQL Server

Max Vernon takes us through the order in which SQL Server searches for tables given a single-part name:

Default schemas in SQL Server can be a blessing, since they reduce the need to specify the schema when creating DDL statements in T-SQL. However, relying on the default schema when creating DML statements can be problematic. A recent question on asked “Does T-SQL have a Schema search path?”, similar to PostgreSQL implements the search_pathparameter. This post shows how schemas are implemented in SQL Server. We’ll also see why it’s important to always specify the schema when using SQL Server.

A lot of this behavior goes back to the pre-2005 era. 2005 introduced schemas as logical separators, whereas pre-2005 they were more of a security measure (and dbo was the database owner’s schema). I completely agree that you should specify two-part names in-database. It’s a tiny bit faster (which adds up when you’re doing thousands of transactions per second) and reduces ambiguity.

Extended Event Filters Outlive Sessions

Dave Bland ran into an interesting problem during a demo:

Recently during a demo at a SQL Saturday the query to pull the Extended Event session data, didn’t return the expected results. The session I used for the demo was the create database statement.

Prior to the session, I deleted the Create Database session, however did not delete the target files because they are part of the demo.  Then I recreated the session, just as I had done before.  However, this time was there was a difference when I attempted to read the target data.  The entry for the newly created database was not showing up when I used the GUI, however was showing up when I read the XML.  During the session, I was not able to figure out why that was the case.

Click through to see the root cause and how Dave fixed the problem.

Offloading Code Review Burdens with Automation

Ed Elliott argues that automation and testing can make code reviews easier:

OK so if we break this down into what a DBA should be doing as part of a code review:

– Ensure formatting is correct and any standards followed
– Have they introduces a SQL injection vulnerability?
– Consider any side effects of the actual change, for instance altering a clustered key on a 1 billion row table will take time – is this possible on a live system?
– Consider any performance effects – is this more prone to tempdb spills? How about deadlocks? Is the plan going to be terrible?
– Is the code going to do what the developer wants? Do they have the update statement correct in the merge statement?

That’s a lot, how can we help developers understand enough so that they can review their own code and cause fewer issues in production?

I believe this is a bit aspirational. Nevertheless, if you do get there, life gets easier.

Tracking xp_cmdshell Executions

Jason Brimhall shows how you can see when someone calls xp_cmdshell, including the call details:

What was the wait_type? Well, the obscure wait_type was called PREEMPTIVE_OS_PIPEOPS. What causes this wait? As it turns out, this is a generic wait that is caused by SQL pipe related activities such as xp_cmdshell.

Knowing this much information however does not get us to the root cause of this particular problem for this client. How do we get there? This is another case for Extended Events (XEvents).

Read on for two ways to approach this, both using Extended Events.

An Example of p-Hacking

Vincent Granville explains why using p-values for model-worthiness can lead you to a bad outcome:

Recently, p-values have been criticized and even banned by some journals, because they are used by researchers, who cherry-pick observations and repeat experiments until they obtain a p-value worth publishing to obtain grant money, get tenure, or for political reasons.  Even the American Statistical Association wrote a long article about why to avoid p-values, and what you should do instead: see here.  For data scientists, obvious alternatives include re-sampling techniques: see here and here. One advantage is that they are model-independent, data-driven, and easy to understand. 

Here we explain how the manipulation and treachery works, using a simple simulated data set consisting of purely random, non-correlated observations. Using p-values, you can tell anything you want about the data, even the fact that the features are highly correlated, when they are not. The data set consists of 16 variables and 30 observations, generated using the RAND function in Excel. You can download the spreadsheet here.

And for a more academic treatment of the problem, I love this paper by Andrew Gelman and Eric Loken, particularly because it points out that you don’t have to have malicious intent to end up doing the wrong thing.

Predicting Intermittent Demand

Bruno Rodrigues shows one technique for forecasting intermittent data:

Now, it is clear that this will be tricky to forecast. There is no discernible pattern, no trend, no seasonality… nothing that would make it “easy” for a model to learn how to forecast such data.

This is typical intermittent demand data. Specific methods have been developed to forecast such data, the most well-known being Croston, as detailed in this paper. A function to estimate such models is available in the {tsintermittent} package, written by Nikolaos Kourentzes who also wrote another package, {nnfor}, which uses Neural Networks to forecast time series data. I am going to use both to try to forecast the intermittent demand for the {RDieHarder} package for the year 2019.

Read the whole thing. H/T R-Bloggers

TensorFrames: Spark Plus TensorFlow

Adi Polak gives us an introduction to TensorFrames:

In all TensorFrames functionality, the DataFrame is sent together with the computations graph. The DataFrame represents the distributed data, meaning in every machine there is a chunk of the data that will go through the graph operations/ transformations. This will happen in every machine with the relevant data. Tungsten binary format is the actual binary in-memory data that goes through the transformation, first to Apache Spark Java object and from there it is sent to TensorFlow Jave API for graph calculations. This all happens in the Spark Worker process, the Spark worker process can spin many tasks which mean various calculation at the same time over the in-memory data.

An interesting bit of turnabout here is that the Scala API is the underdeveloped one; normally for Spark, the Python API is the Johnny-Come-Lately version.

Linear Regression With Python In Power BI

Emanuele Meazzo builds a linear regression in Power BI using a Python visual:

As a prerequisite, of course, you’ll need to have python installed in your machine, I recommend having an external IDE like Visual Studio Code to write your Python code as the PowerBI window offers zero assistance to coding.

You can follow this article in order to configure Python Correctly for PowerBI.

Step 2 is to add a Python Visual to the page, and let the magic happen.

Click through for the step-by-step instructions, including quite a bit of Python code and a few warnings and limitations.

T-SQL Tuesday 115 Roundup

Mohammad Darab has a roundup for this month’s T-SQL Tuesday:

It was an absolute honor to host this month’s TSQL Tuesday. I decided on doing the “Dear 20 year old self” as a way for us to reflect on life. It seemed like this topic hit home with a lot of people. I enjoyed reading each one of the posts.

If you don’t find your post in this Round Up, please email me your link and I will update this post!

There were 14 responses this month; click through for the full set.

Testing Inline Scalar UDF Performance

Erik Darling whips up a performance test covering scalar UDF changes in SQL Server 2019:

This is a slightly different take on yesterday’s post, which is also a common problem I see in queries today.

Someone wrote a function to figure out if a user is trusted, or has the right permissions, and sticks it in a predicate — it could be a join or where clause.

If you do need to use scalar UDFs, SQL Server 2019 is a big step forward.


June 2019
« May