Press "Enter" to skip to content

Curated SQL Posts

Biml Global Directives

Ben Weissman discusses a new directive in Biml:

One of the awesome new features in Biml is a new directive called “global”! It does exactly, what you would expect it to do: it allows you to add code to all or some of your Biml files at once.

Here is an example: This file with only 2 lines will make VB your default language across your entire solution!

I’d prefer an F# global directive, myself…  But this looks like a very useful addition to Biml.

Comments closed

Polybase In Azure SQL Data Warehouse

Simon Whiteley loves Polybase as much as I do:

“Polybase is by far the fastest way to import/export data from SQLDW. If you’re loading any reasonably sized data volume, you should be using Polybase”

That’s not a quote – I just thought it would be more impactful looking like one!

For those of a traditional “Big Data” background, Polybase is essentially the same as an external Hive table, embedded into a traditional relational database.

For everyone else – Polybase is a special kind of SQL table. It looks like a table, it has columns and you can run normal T-SQL on it. However, instead of holding any data locally, your query is converted into map-reduce jobs against flat files – even those in sub-folders. For me, this is mind-bogglingly cool – you can query thousands of files at once by typing “select * from dbo.myexternaltable”.

Simon also covers limitations in Polybase:

Push-down predicates

This one is a biggie – if you’re querying over a whole range of flat files that are organised into [YEAR]/[MONTH] folders, for example, you should be able to write a query like the following:

SELECT * FROM dbo.MyExternalTable WHERE [YEAR] > 2016

This filter would be pushed down to the polybase engine and tell it to ignore any files that have been vertically partitioned outside of our chosen range. Instead, all files are read and returned to the SQL Server engine and the filtering is done in-memory on the returned dataset. This is obviously hugely inefficient in some cases – especially when you’re using Polybase as a data loading mechanism. This feature is available in HIVE tables and you can do it in U-SQL – hopefully it’s only a matter of time before a similar feature is implemented here.

It’s an interesting look at Polybase, focusing on Azure SQL Data Warehouse.

Comments closed

Fun With Temp Tables

Kenneth Fisher answers a Brent Ozar pop quiz regarding temp tables:

Go ahead and give it a shot .. I’ll wait.

So? What do you think? Did you get it right? I did, but I wasn’t 100% certain, nor did I initially think through all of the implications. The question actually has more depth to it than you might think on the surface. So I thought it would be fun to go through what I was thinking before I made my decision, what actually happens, and what I realized afterward.

Read on for the answers.  For bonus fun, check out Brent’s comment to the post.

Comments closed

Ditching IsNumeric()

Phil Factor on the IsNumeric() function:

IsNumeric() is actually answering the question ‘Can this string be converted or ‘cast’ to a numeric datatype by SQL Server?’ It is pretty good at getting this answer right but it doesn’t tell you which datatype. It also isn’t interested in such things as overflow.

This problem of IsNumeric() was solved in SQL Server 2012. There was no way of morphing IsNumeric() intro something more valuable, so Try_Cast(), Try_Parse(), and Try_Convert() were introduced. The introduction of these system functions really solve the problem, unless you are still on earlier versions than SQL Server 2012. I’ll show how to do the same thing in previous versions of SQL Server, and demonstrates one or two tricks you can do with these functions.

I’ll stick with the post-2012 version, please.

Comments closed

CSV Import Speeds With H2O

WenSui Liu benchmarks three CSV loading methods in R:

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

I’d wonder if there are cases where this would vary significantly; regardless, for reading a large data file, parallel processing does tend to be faster.

Comments closed

Linear Prediction Confidence Region Flare-Out

John Cook explains why the confidence region of a tracked object flares out instead of looking conical (or some other shape):

Suppose you’re tracking some object based on its initial position x0 and initial velocity v0. The initial position and initial velocity are estimated from normal distributions with standard deviations σx and σv. (To keep things simple, let’s assume our object is moving in only one dimension and that the distributions around initial position and velocity are independent.)

The confidence region for the object flares out over time, something like the bell of a trumpet.

Read on for the explanation.

Comments closed

Building Random Number Ranges

David Fowler shows how to generate a random number for each record in a result set:

Hmmmmm…. It looks like we’ve got the same number for every person, that wasn’t what we wanted.  And that’s my issue with RAND(), it’ll give you a different random number every time it runs but if run as part of a query it’ll always return the same number for every row returned.

So what else can we do?  We’ll there is something that gives us a ‘random’ value for every row in the query and that’s our good old friend NEWID().  Let’s try the same query but this time we’ll swap RAND() with NEWID().

One major use case for this is sampling data sets for model training and testing:  if you pull from a range of 1-10, you could perhaps train against 1-5, cross-validate against 6-7, and test against 8-10.  Doing this instead of TOP X% reduces the likelihood of sampling bias.

Comments closed

Conditional Job Retry

Chris Bell has a procedure which conditionally retries a failed SQL Agent job from a pre-determined step:

When the job fails, and the alert message compiled, this procedure gets called and the job name, step name, a delay value are passed to it. There is also a retry flag that comes back fro this procedure.

The first thing this procedure does is go and find the last failed step for the particular job. It then counts and based on the @retry value verifies if a retry job has already been created. This is in case some other process tries to do this same thing and should help prevent too many retries from firing off.
If a retry job does not exist, this process creates a new disposable job that will rerun the original from the beginning or the step that failed based on the checking for “Level 1” or “Level 2” in the job name. The job is prefixed with ‘Retry -‘ so it can be found easily in your server’s job list.
If a delay is specified, 2 minutes in this example, then it calculates a new run time for the retry job and finally creates the job.

This helps make SQL Agent jobs a little more robust.

Comments closed

Bundling Measures Together

Philip Seamark shows how to bundle measures together in Power BI so they all appear at the top of the Fields section:

I’m going to share in this blog a technique I’ve found useful in Power BI for collecting measures together in once place AND placing them at the top of the field list.

The good news is, calculated measures do not have to exist on the table that stores the underlying data specific to that measure.  Measures can be placed on any table in the model and they will still work as expected.  This may not be immediately obvious but it’s handy to know.

So far, I’ve kept measures on their logical best-fit tables, but Philip’s hint looks quite useful once the set of measures grows, or if there are a number of cross-table measures.

Comments closed

Separating Data And Log Files

Brent Ozar looks at an old chestnut:

So it’s time for a quiz:

  1. If you put all of a SQL Server’s data files & logs on a single volume, how many failures will that server experience per year?
    • Bonus question: what kinds of data loss and downtime will each of those failure(s) have?
  2. If you split a SQL Server’s data files onto one volume, and log files onto another volume, how many failures will that server experience per year?
    • Bonus question: what kinds of data loss and downtime will each of those failures have?

Think carefully about the answers – or read the comments to see someone else’s homework, hahaha – before you move on.

With SANs, this advice is not even that good on the performance side—especially with modern SANs which don’t let you dedicate spindles.  It definitely doesn’t fly on the reliability side.

Comments closed