Press "Enter" to skip to content

Curated SQL Posts

HASSP

Drew Furgiuele wants to put SQL Server into space.  I’ve linked to the entire series thus far, which has been fun to follow.  Here’s an excerpt from his latest post:

That’s from the “Hardware and Software Requirements for Installing SQL Server” product page.  I’ve had people ask if I was using a Raspberry Pi, or some other Micro ATX PC. The answer is neither; the problem with a Pi is that it’s not a 64-bit processor. Pis use ARM architecture, and SQL Server doesn’t (yet) support ARM. Also, most Pi 3’s run at 1.2Ghz and only support 1GB of RAM. As for MicroATX form factor PCs, they’re closer to what we’d need, but they’re still heavy. Plus, you’d need a pretty substantial power supply that we couldn’t (safely) support that high up in the sky. Even if you stripped it down to bare components, it’d be pushing it.

There are companies that make small SoC solutions, but after evaluating them we determined that they were either pretty flaky or got so hot they risked bursting into flames even just booting into Windows. Instead, we found a really unique piece of hardware: the Intel Joule.

Comments closed

No Curation Today

Happy 4th of July.  Because today is a day for eating hot dogs and blowing stuff up, our normally scheduled curation is on hold.  We’ll pick up once more tomorrow.

In the meantime, stand by for a couple larger links.

Comments closed

Thinking About The Data Lake

James Serra explains at a high level what the data lake metaphor is and how it works:

The data lake introduces a new data analysis paradigm shift:

OLD WAY: Structure -> Ingest -> Analyze

NEW WAY: Ingest -> Analyze -> Structure

This allows you to avoid a lot of up-front work before you are able to analyze data.  With the old way, you have to know the questions to ask.  The new way supports situations when you don’t know the questions to ask.

This solves the two biggest reasons why many EDW projects fail:

  • Too much time spent modeling when you don’t know all of the questions your data needs to answer

  • Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

There are some good details here.  My addition would be to reiterate the importance of a good data governance policy.

Comments closed

The Biml Interrogator

Shannon Lowder has an imposing-sounding project which does some very cool things:

After building connections, you need to build the file formats.  01_FileFormats.biml calls the interrogator class I built and tries to guess the structure of the files.  The script takes that information and then builds the Biml structure representing the layout of any csv or txt files it finds in SourceFolder. You can run this like the connections, Ctrl-click 00_GetOutput and 01_FileFormats, then right-click on the highlighted area and choose Generate SSIS Packages. You can then see the FileFormat nodes in output.biml.

After you have a biml representation of FileFormats, you can then generate the connections for the FlatFiles.  FlatFile connections have a required attribute for FileFormat. To see the code for the connections, you have to select 00_GetOutput, 01_FileFormats, and 02_Connections-FlatFiles. Then output will contain both the FileFormats and Connections to the flat files!

Read on for more details.

Comments closed

Renaming Default Constraints

Bill Fellows has a script to standardize default constraint names:

This week I’m dealing with synchronizing tables between environments and it seems that regardless of what tool I’m using for schema compare, it still gets hung up on the differences in default names for constraints. Rather than fight that battle, I figured it’d greatly simplify my life to systematically rename all my constraints to non default names. The naming convention I went with is DF__schema name_table name_column name. I know that my schemas/tables/columns don’t have spaces or “weird” characters in them so this works for me. Use this as your own risk and if you are using pre-2012 the CONCAT call will need to be adjusted to classic string concatenation, a.k.a. +

Click through for the script and be sure to check out Robert’s comment.  You can also revise this script to “rename” (i.e., drop and re-create) foreign key or check constraints, though those might take a while depending upon how much data you’ve got.

Comments closed

What’s Wrong With CRISP-DM

Jen Stirrup explains the issues with CRISP-DM, a model for data mining:

The model no longer seems to be actively maintained. At the time of writing, the official site, CRISP-DM.org, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.

Jen then contrasts with with Microsoft’s Team Data Science Process framework; click through for that.

Comments closed

Using Availability Groups For Upgrades

Adrian Buckman has a fun post on upgrading to SQL Server 2017 (CTP) using Availability Groups to minimize downtime:

Don’t panic, this is still going as planned as this is totally expected and this is the reason why:

We are now in a situation where we have the Primary server running 2017 but one (for us) or possibly more than one for you running on 2016 , its not possible for the 2016 server to synchronize as its databases have not been upgraded yet, they will therefore be stuck in recovery but we are about to fix that very soon.

This is a viable upgrade option:  we did it when upgrading from 2014 to 2016.  There are a lot of steps, but in the end, it worked fine.

Comments closed

Tracking Database Restorations

Erik Darling points out that figuring out when a database restoration occurs is much more difficult than you’d hope:

Astute SQL-ers may attempt to add a trigger to the restorehistory table over in msdb. It’s in the dbo schema, which might make you hopeful. We all know triggers in that pesky sys schema don’t do a darn thing.

You guessed it, restores get tracked there. So there’s, like, something inside SQL telling it when a restore happens.

Guess what, though? A trigger on that table won’t fire. Not FOR INSERT, not AFTER INSERT, and not nothin’ in between.

Read on for more things that don’t work…  Also check out the comments; I think Dave Mason has the best answer there.

Comments closed

Mann-Whitney U Test in SQL

Phil Factor continues his Statistics in SQL series with the Mann-Whitney U test:

There are several ways that you can test this, but nobody is going to argue with you if you use a Mann–Whitney U test to test whether two samples come from the same distribution. It doesn’t require that the data has any particular type of distribution. It just requires that each observation is done by a different member of the population so that all the observations from both groups are independent of each other. It is really just a test of differences in mean-rank between two populations’ pooled ranking. To test this difference It has to be possible to compare any of the observations with any of the others and say which of the two are greater. Your objective is to disprove the assumption that The distributions of both populations are equal. Calculating a measure of the difference is simple, and was designed to be done easily by hand before computers. The probability that the observed difference occurred by chance is easily calculated for large samples because U then approximates to the normal distribution, but it is complex for small samples. Here, we have a small sample and are just interested in whether the two-tailed test is signifcant at the five percent level so we dodge the bullet by using a significance lookup table for the critical value of U.

Read on for Phil’s implementation of the test.

Comments closed

Comparing Memory-Optimized Versus On-Disk Performance

Erin Stellato has a performance comparison between disk-based and memory-optimized tables:

I developed the following test cases:

  1. A disk-based table with traditional stored procedures for DML.
  2. An In-Memory table with traditional stored procedures for DML.
  3. An In-Memory table with natively compiled procedures for DML.

I was interested in comparing performance of traditional stored procedures and natively compiled procedures, because one restriction of a natively compiled procedure is that any tables referenced must be In-Memory. While single-row, solitary modifications may be common in some systems, I often see modifications occurring within a larger stored procedure with multiple statements (SELECT and DML) accessing one or more tables. The In-Memory OLTP documentation strongly recommends using natively compiled procedures to get the most benefit in terms of performance. I wanted to understand how much it improved performance.

Read on for the results.

Comments closed