Press "Enter" to skip to content

Month: December 2015

Goodness Of Fit

Steve Bolton is one of my favorite long-form analytics bloggers, and his ongoing goodness of fit series is a testament as to why.

Part 1:

Goodness-of-fit tests are also sometimes applicable to regression models, which I introduced in posts like A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression and A Rickety Stairway to SQL Server Data Mining, Algorithm 4: Logistic Regression. I won’t rehash the explanations here for the sake of brevity; suffice it to say that regressions can be differentiated from probability distributions by looking at them as line charts which point towards the predicted values of one or more variables, whereas distributions are more often represented as histograms representing the full range of a variable’s actual or potential values. I will deal with methods more applicable to regression later in this series, but in this article I’ll explain some simple methods for implementing the more difficult concept of a probability distribution.

Part 2:

As I found out the hard way, the difficult part with implementing these visual aids is not in representing the data in Reporting Services, but in calculating the deceptively short formulas in T-SQL. For P-P Plots, we need to compare two cumulative distribution functions (CDFs). That may be a mouthful, but one that is not particularly difficult to swallow once we understand how to calculate probability distribution functions. PDFs[2] are easily depicted in histograms, where we can plot the probability of the occurrence of each particular value in a distribution from left to right to derive such familiar shapes as the bell curve. Since probabilities in stochastic theory always start at 0 and sum to 1, we can plot them a different way, by summing them in succession for each associated value until we reach that ceiling. Q-Q Plots are a tad more difficult because they involve comparing the inverse of the CDFs, using what is alternately known as quantile or percent point functions[3], but not terribly so. Apparently the raison d’etre for these operations is to distill distributions like the Gaussian down to the uniform distribution, i.e. a flat line in which all outcomes are equally likely, for easier comparison.[4]

Part 3:

The most well-known extension of these somewhat forgotten stats is the Jarque-Bera Test, which only dates back to the 1970s despite being one of earliest examples of normality testing. All of these measures have fallen out of favor with statisticians to some extent, for reasons that will be apparent shortly, but one of the side effects of this is that it is a little more difficult to find variations on them that are more suited to the unique needs of the SQL Server community. One of the strengths of data mining on database servers like SQL Server is that you typically have such an enormous number of records to draw from that you can actually perform calculations on the full population, or a proportion close to it. In ordinary statistics, however, you’re often limited to making inferences based on small samples of just a few dozen or a few hundred rows, out of a much larger population that is often of unknown size; the results can still be logically valid, but often only if other preconditions are met on the data (including normality tests, which are often not performed). For that reason, I usually prefer to leverage SQL Server’s fast set-based retrieval methods to quickly calculate statistics on full populations whenever possible, especially when there are simpler versions of the mathematical formulas available for the full dataset.

Steve doesn’t post very frequently, but if you have a few hours on a lazy Friday, check him out.

1 Comment

Dropping Unique Key Constraints

Kenneth Fisher doesn’t like the way SQL Server implements unique key constraints:

So now that we all know the basics what could possibly have gone wrong? Well I was handed an error.

Msg 3723, Level 16, State 5, Line 21
An explicit DROP INDEX is not allowed on index ‘TblUniqueConstraint.uni_TblUniqueConstraint’. It is being used for UNIQUE KEY constraint enforcement.

Someone had created a process several years ago that dropped and re-created indexes (I’m not going to go into why right now). Well this particular index is used to enforce a unique constraint and so it can’t be dropped. If you want to follow along here is some quick code to duplicate the problem.

The appropriate way to drop a unique key constraint is ALTER TABLE [TableName] DROP CONSTRAINT [ConstraintName].

I disagree with Kenneth that there’s no value in unique key constraints (I’m guessing implicit in here is “in comparison to using CREATE UNIQUE NONCLUSTERED INDEX” syntax).  There’s a semantic difference between an index which happens to be unique versus a unique key constraint.  They’re implemented very similarly, but the point of the latter is to tell anybody using the data model that this set of attributes must be unique.

Comments closed

Moving The Master Database

Mike Fal shows us how to move the master database:

Last week I blogged about how you can use Smo.ManagedComputer class to update service accounts for your SQL Services. My friend Shawn Melton(@wsmelton) decided to explore the class a little more and discovered how you can change your startup parameters using this class. Shawn’s examples are focused around adding common trace flags to your instance, which is a typical practice. However, let us look at using this to move master.

A word of warning:  don’t move the master database unless you really need to.

Comments closed

Migrating A VM To Azure

I have a post up on how I migrated my presentation VM to Azure:

Once sysprep was done, I needed to find a way to get the VMDK files converted to VHDs.  A blog post turned me on to StarWind Software’s V2V Converter.  It’s a free tool which allows you to convert virtual hard drive files from one format to another.  Installing this tool let me turn my set of VMDKs into one 45GB VHD.  One note is that, at least on my machine, I needed to run the V2V Converter from a command prompt; executing the app directly from the Start menu would cause the app to appear for a moment and then disappear, as though some error killed the program.  The tool installs by default in “%programfiles(x86)%\StarWind Software\StarWind V2V Image Converter\StarV2V.exe” From there, I just needed to get that big image into Azure.

This VM is really a Plan C or Plan D for me, but it’s good to have layers of redundancy.

Comments closed

Filestream Garbage Collection Bug

Andy Mallon found a bug with Filestream:

I tracked the problem back to this specific scenario: When you have a database that is part of an Availability Group, and you drop a table that contains filestream data, the filestream garbage collection does not clean up the data container subdirectory that corresponds to that table. Garbage collection will continue to clean up other items (eg, deleted rows), but the dropped table never gets cleaned up.

I’ve logged this as a connect item for your up-voting pleasure.

Go forth and upvote.

Comments closed

In-Memory Analytics

Sunil Agarwal introduces us to In-Memory Analytics, forthcoming in SQL Server 2016:

SQL Server 2016 has significant advancements over SQL Server 2014 for In-Memory analytics. Some highlights are functionality (e.g. ability to create traditional nonclustered index to enforce PK/FK), performance (e.g. addition of new BatchMode operators, Aggregate pushdown), Online index defragmentation, and supportability (e.g. new DMVs, Perfmon counters and XEvents).

His post talks a little bit about in-memory, but focuses more on clustered columnstore indexes.  I like that columnstore indexes are getting V3 improvements, and I think they’ll be even more useful.  Whether the “in-memory” part becomes useful is a different question; I personally have seen a very limited adoption of In-Memory OLTP (and a few huge bugs for the people brave enough to try it).

Comments closed

Run DBCC CHECKDB

Andy Galbraith reminds us to check database consistency:

At the end of the day you can’t afford *not* to run CHECKDB- you just can’t.  Your database objects are only as reliable as your last clean CHECKDB (meaning your last CHECKALLOC + CHECKCATALOG + individual CHECKTABLE’s on all of your tables – but more on that in a moment).

If you are not running regular CHECKDB, you may have endless unknown corruption across your databases, and you won’t find out until someone tried to access the data.

Even if your production instance is running at 100% all the time (which is a good indicator that you need more resources), you can run CHECKDB against backups restored to a different server.

Comments closed

Entity Framework Performance Problems

Cristian Satnic points out potential performance issues with using Entity Framework:

Entity Framework will not exactly issue SELECT * FROM commands – what it will do though is have explicit SELECT statements that include ALL columns in a particular table. If you see that most SQL queries are selecting all columns this way (especially from large tables when it appears that the UI is not using all that data) then you know that developers got a little sloppy with their code. It’s very easy in Entity Framework to bring back all columns – it takes more work and thought to build a custom LINQ projection query to select only the needed columns.

It is taking all of my strength not to say “A tell-tale sign of an Entity Framework performance problem is seeing Entity Framework in your environment.”

Satnic’s advice isn’t EF-specific, but his link to Microsoft guidance is.

Comments closed

Master Data Services Bug

Kenneth Nielsen has found a bug in Master Data Services:

Something is not right here, there are some inconsistency in the way MDS behaves in regard to attribute management.

  • Changes to an attributes length is not supported in web interface
  • Changes to an attributes length is supported in Excel Add-In
  • Changes to an attributes length is limited to max 1000 in Excel Add-in
  • New attributes support a length of max 4000 in web interface
  • New attributes support a length of max 1000 in Excel Add-in
  • It is not at all possible to change an attributes length once it is set to 4000

With SQL Server 2016 CTP 3.1, Nielsen found a couple of these have been corrected, but not all of them.

Update, 2016-01-04:

Kenneth notes that the bug will be fixed in CTP 3.3.  Very nice.

Comments closed

Loading Azure SQL Database Data

Mickey Stuewe digs into Azure SQL Database:

While I waited…and waited for my Utility data to be inserted into my Azure database, I did some poking around to see if it was even possible to restore a local SQL Server 2014 backup to an Azure database. Guess what, I found something (And there was much rejoicing).

On CodePlex, I found a SQL Database Migration Wizard that can be used to restore the database. They even had Migration Wizards for 2008R2, 2012, and 2014. SQL Database Migration Wizard v3.15.6, v4.15.6 and v5.15.6

If you have an MSDN license, go play with this.  Even if you don’t, the lowest tier Azure SQL Database instances are free, so there’s no reason not to learn about them.

Comments closed