Steve Bolton is one of my favorite long-form analytics bloggers, and his ongoing goodness of fit series is a testament as to why.
Goodness-of-fit tests are also sometimes applicable to regression models, which I introduced in posts like A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression and A Rickety Stairway to SQL Server Data Mining, Algorithm 4: Logistic Regression. I won’t rehash the explanations here for the sake of brevity; suffice it to say that regressions can be differentiated from probability distributions by looking at them as line charts which point towards the predicted values of one or more variables, whereas distributions are more often represented as histograms representing the full range of a variable’s actual or potential values. I will deal with methods more applicable to regression later in this series, but in this article I’ll explain some simple methods for implementing the more difficult concept of a probability distribution.
As I found out the hard way, the difficult part with implementing these visual aids is not in representing the data in Reporting Services, but in calculating the deceptively short formulas in T-SQL. For P-P Plots, we need to compare two cumulative distribution functions (CDFs). That may be a mouthful, but one that is not particularly difficult to swallow once we understand how to calculate probability distribution functions. PDFs are easily depicted in histograms, where we can plot the probability of the occurrence of each particular value in a distribution from left to right to derive such familiar shapes as the bell curve. Since probabilities in stochastic theory always start at 0 and sum to 1, we can plot them a different way, by summing them in succession for each associated value until we reach that ceiling. Q-Q Plots are a tad more difficult because they involve comparing the inverse of the CDFs, using what is alternately known as quantile or percent point functions, but not terribly so. Apparently the raison d’etre for these operations is to distill distributions like the Gaussian down to the uniform distribution, i.e. a flat line in which all outcomes are equally likely, for easier comparison.
The most well-known extension of these somewhat forgotten stats is the Jarque-Bera Test, which only dates back to the 1970s despite being one of earliest examples of normality testing. All of these measures have fallen out of favor with statisticians to some extent, for reasons that will be apparent shortly, but one of the side effects of this is that it is a little more difficult to find variations on them that are more suited to the unique needs of the SQL Server community. One of the strengths of data mining on database servers like SQL Server is that you typically have such an enormous number of records to draw from that you can actually perform calculations on the full population, or a proportion close to it. In ordinary statistics, however, you’re often limited to making inferences based on small samples of just a few dozen or a few hundred rows, out of a much larger population that is often of unknown size; the results can still be logically valid, but often only if other preconditions are met on the data (including normality tests, which are often not performed). For that reason, I usually prefer to leverage SQL Server’s fast set-based retrieval methods to quickly calculate statistics on full populations whenever possible, especially when there are simpler versions of the mathematical formulas available for the full dataset.
Steve doesn’t post very frequently, but if you have a few hours on a lazy Friday, check him out.
Jan Mulkens has started an interesting series on data science using the Microsoft stack. His first post is an overview of the products currently available:
But on a more serious note, I’m going to be crude to Microsoft here.
A long time ago, Power BI started as an over-hyped and underwhelming experience. Everyone saw the potential this Excel stuff had but I’m guessing the experience most people had was similar to mine. That is, Power BI back then was a disappointment because of what we were expecting.
The one good thing it did have at one point was PowerPivot.
Skip forward to august 2015.
The Power BI dream had suddenly come true!
Most of the things we were expecting in the past suddenly were there, in a web service AND a desktop application.
From there, Mulkens shares a number of training materials:
Make Microsoft’s Virtual Academy your first or last stop when learning, but you should always pay it a visit!
It’s filled with incredible information broken down in some great free courses.
It seems that (at least some of) the closed edX.org courses are being placed on here, so you can follow up on them at your own pace.
Do be aware that you can’t receive certificates on Microsoft Virtual Academy.
This is an exciting time to jump into analytics. Most of the material is free, and it’s easy to get VMs to practice, so the barrier to entry is low.
Buck Woody’s back to blogging, and his focus is data science. Over the past month, he’s looked at R and Python.
In future notebook entries we’ll explore working with R, but for now, we need to install it. That really isn’t that difficult, but it does bring up something we need to deal with first. While the R environment is truly amazing, it has some limitations. It’s most glaring issue is that the data you want to work with is loaded into memory as a frame, which of course limits the amount of data you can process for a given task. It’s also not terribly suited for parallelism – many things are handled as in-line tasks. And if you use a package in your script, you have to ensure others load that script, and at the right version.
Enter Revolution Analytics – a company that changed R to include more features and capabilities to correct these issues, along with a few others. They have a great name in the industry, bright people, and great products – so Microsoft bought them. That means the “RRE” engine they created is going to start popping up in all sorts of places, like SQL Server 2016, Azure Machine Learning, and many others. But the “stand-alone” RRE products are still available, and at the current version. So that’s what we’ll install.
Python has some distinct differences that make it attractive for working in data analytics. It scales well, is fairly easy to learn and use, has an extensible framework, has support for almost every platform around, and you can use it to write extensive programs that work with almost any other system and platform.
R and Python are the two biggest languages in this slice of the field, and you’ll gain a lot from learning at least one of these languages.