What’s Wrong With CRISP-DM

Jen Stirrup explains the issues with CRISP-DM, a model for data mining:

The model no longer seems to be actively maintained. At the time of writing, the official site, CRISP-DM.org, is no longer being maintained. Further, the framework itself has not been updated on issues on working with new technologies, such as Big Data.

Jen then contrasts with with Microsoft’s Team Data Science Process framework; click through for that.

Using Availability Groups For Upgrades

Adrian Buckman has a fun post on upgrading to SQL Server 2017 (CTP) using Availability Groups to minimize downtime:

Don’t panic, this is still going as planned as this is totally expected and this is the reason why:

We are now in a situation where we have the Primary server running 2017 but one (for us) or possibly more than one for you running on 2016 , its not possible for the 2016 server to synchronize as its databases have not been upgraded yet, they will therefore be stuck in recovery but we are about to fix that very soon.

This is a viable upgrade option:  we did it when upgrading from 2014 to 2016.  There are a lot of steps, but in the end, it worked fine.

Tracking Database Restorations

Erik Darling points out that figuring out when a database restoration occurs is much more difficult than you’d hope:

Astute SQL-ers may attempt to add a trigger to the restorehistory table over in msdb. It’s in the dbo schema, which might make you hopeful. We all know triggers in that pesky sys schema don’t do a darn thing.

You guessed it, restores get tracked there. So there’s, like, something inside SQL telling it when a restore happens.

Guess what, though? A trigger on that table won’t fire. Not FOR INSERT, not AFTER INSERT, and not nothin’ in between.

Read on for more things that don’t work…  Also check out the comments; I think Dave Mason has the best answer there.

Mann-Whitney U Test in SQL

Phil Factor continues his Statistics in SQL series with the Mann-Whitney U test:

There are several ways that you can test this, but nobody is going to argue with you if you use a Mann–Whitney U test to test whether two samples come from the same distribution. It doesn’t require that the data has any particular type of distribution. It just requires that each observation is done by a different member of the population so that all the observations from both groups are independent of each other. It is really just a test of differences in mean-rank between two populations’ pooled ranking. To test this difference It has to be possible to compare any of the observations with any of the others and say which of the two are greater. Your objective is to disprove the assumption that The distributions of both populations are equal. Calculating a measure of the difference is simple, and was designed to be done easily by hand before computers. The probability that the observed difference occurred by chance is easily calculated for large samples because U then approximates to the normal distribution, but it is complex for small samples. Here, we have a small sample and are just interested in whether the two-tailed test is signifcant at the five percent level so we dodge the bullet by using a significance lookup table for the critical value of U.

Read on for Phil’s implementation of the test.

Comparing Memory-Optimized Versus On-Disk Performance

Erin Stellato has a performance comparison between disk-based and memory-optimized tables:

I developed the following test cases:

  1. A disk-based table with traditional stored procedures for DML.
  2. An In-Memory table with traditional stored procedures for DML.
  3. An In-Memory table with natively compiled procedures for DML.

I was interested in comparing performance of traditional stored procedures and natively compiled procedures, because one restriction of a natively compiled procedure is that any tables referenced must be In-Memory. While single-row, solitary modifications may be common in some systems, I often see modifications occurring within a larger stored procedure with multiple statements (SELECT and DML) accessing one or more tables. The In-Memory OLTP documentation strongly recommends using natively compiled procedures to get the most benefit in terms of performance. I wanted to understand how much it improved performance.

Read on for the results.

Batched Data Archival

Kevin Feasel


Data, T-SQL

Anders Pedersen has a script to archive data in smaller batches:

Seeing as the data had to be retrievable for any date, I could not simply delete the very old data.  These tables also had constant inserts and updates into them, so making sure the tables remained available became important, i.e. needed to have acceptable time that the table was being locked, with time for waiting transactions to finish.

The solution I came up with does this with variable size batches.  Now, with modern versions of SQL, there are other ways to do this, but the good thing about this method it works regardless of version of SQL, as well as edition.  Azure SQL DB would need some modification to make it work to archive to a separate database.

Click through for the script.


July 2017
« Jun Aug »