Thinking About The Data Lake

James Serra explains at a high level what the data lake metaphor is and how it works:

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data.  With the old way, you have to know the questions to ask.  The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

  • Too much time spent modeling when you don’t know all of the questions your data needs to answer

  • Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

There are some good details here.  My addition would be to reiterate the importance of a good data governance policy.

The Biml Interrogator

Kevin Feasel



Shannon Lowder has an imposing-sounding project which does some very cool things:

After building connections, you need to build the file formats.  01_FileFormats.biml calls the interrogator class I built and tries to guess the structure of the files.  The script takes that information and then builds the Biml structure representing the layout of any csv or txt files it finds in SourceFolder. You can run this like the connections, Ctrl-click 00_GetOutput and 01_FileFormats, then right-click on the highlighted area and choose Generate SSIS Packages. You can then see the FileFormat nodes in output.biml.

After you have a biml representation of FileFormats, you can then generate the connections for the FlatFiles.  FlatFile connections have a required attribute for FileFormat. To see the code for the connections, you have to select 00_GetOutput, 01_FileFormats, and 02_Connections-FlatFiles. Then output will contain both the FileFormats and Connections to the flat files!

Read on for more details.

Renaming Default Constraints

Bill Fellows has a script to standardize default constraint names:

This week I’m dealing with synchronizing tables between environments and it seems that regardless of what tool I’m using for schema compare, it still gets hung up on the differences in default names for constraints. Rather than fight that battle, I figured it’d greatly simplify my life to systematically rename all my constraints to non default names. The naming convention I went with is DF__schema name_table name_column name. I know that my schemas/tables/columns don’t have spaces or “weird” characters in them so this works for me. Use this as your own risk and if you are using pre-2012 the CONCAT call will need to be adjusted to classic string concatenation, a.k.a. +

Click through for the script and be sure to check out Robert’s comment.  You can also revise this script to “rename” (i.e., drop and re-create) foreign key or check constraints, though those might take a while depending upon how much data you’ve got.

What’s Wrong With CRISP-DM

Jen Stirrup explains the issues with CRISP-DM, a model for data mining:

The model no longer seems to be actively maintained. At the time of writing, the official site,, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.

Jen then contrasts with with Microsoft’s Team Data Science Process framework; click through for that.

Using Availability Groups For Upgrades

Adrian Buckman has a fun post on upgrading to SQL Server 2017 (CTP) using Availability Groups to minimize downtime:

Don’t panic, this is still going as planned as this is totally expected and this is the reason why:

We are now in a situation where we have the Primary server running 2017 but one (for us) or possibly more than one for you running on 2016 , its not possible for the 2016 server to synchronize as its databases have not been upgraded yet, they will therefore be stuck in recovery but we are about to fix that very soon.

This is a viable upgrade option:  we did it when upgrading from 2014 to 2016.  There are a lot of steps, but in the end, it worked fine.

Tracking Database Restorations

Erik Darling points out that figuring out when a database restoration occurs is much more difficult than you’d hope:

Astute SQL-ers may attempt to add a trigger to the restorehistory table over in msdb. It’s in the dbo schema, which might make you hopeful. We all know triggers in that pesky sys schema don’t do a darn thing.

You guessed it, restores get tracked there. So there’s, like, something inside SQL telling it when a restore happens.

Guess what, though? A trigger on that table won’t fire. Not FOR INSERT, not AFTER INSERT, and not nothin’ in between.

Read on for more things that don’t work…  Also check out the comments; I think Dave Mason has the best answer there.

Mann-Whitney U Test in SQL

Phil Factor continues his Statistics in SQL series with the Mann-Whitney U test:

There are several ways that you can test this, but nobody is going to argue with you if you use a Mann–Whitney U test to test whether two samples come from the same distribution. It doesn’t require that the data has any particular type of distribution. It just requires that each observation is done by a different member of the population so that all the observations from both groups are independent of each other. It is really just a test of differences in mean-rank between two populations’ pooled ranking. To test this difference It has to be possible to compare any of the observations with any of the others and say which of the two are greater. Your objective is to disprove the assumption that The distributions of both populations are equal. Calculating a measure of the difference is simple, and was designed to be done easily by hand before computers. The probability that the observed difference occurred by chance is easily calculated for large samples because U then approximates to the normal distribution, but it is complex for small samples. Here, we have a small sample and are just interested in whether the two-tailed test is signifcant at the five percent level so we dodge the bullet by using a significance lookup table for the critical value of U.

Read on for Phil’s implementation of the test.

Comparing Memory-Optimized Versus On-Disk Performance

Erin Stellato has a performance comparison between disk-based and memory-optimized tables:

I developed the following test cases:

  1. A disk-based table with traditional stored procedures for DML.
  2. An In-Memory table with traditional stored procedures for DML.
  3. An In-Memory table with natively compiled procedures for DML.

I was interested in comparing performance of traditional stored procedures and natively compiled procedures, because one restriction of a natively compiled procedure is that any tables referenced must be In-Memory. While single-row, solitary modifications may be common in some systems, I often see modifications occurring within a larger stored procedure with multiple statements (SELECT and DML) accessing one or more tables. The In-Memory OLTP documentation strongly recommends using natively compiled procedures to get the most benefit in terms of performance. I wanted to understand how much it improved performance.

Read on for the results.

Batched Data Archival

Kevin Feasel


Data, T-SQL

Anders Pedersen has a script to archive data in smaller batches:

Seeing as the data had to be retrievable for any date, I could not simply delete the very old data.  These tables also had constant inserts and updates into them, so making sure the tables remained available became important, i.e. needed to have acceptable time that the table was being locked, with time for waiting transactions to finish.

The solution I came up with does this with variable size batches.  Now, with modern versions of SQL, there are other ways to do this, but the good thing about this method it works regardless of version of SQL, as well as edition.  Azure SQL DB would need some modification to make it work to archive to a separate database.

Click through for the script.


July 2017
« Jun