Data Quality

Kevin Feasel



Milind Paradkar discusses clean data:

We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.

NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.

I like this post because it exposes a data quality issue people don’t tend to think about very often:  when all of the data is legitimate and correctly-structured, but there are gaps in the available data set.  This is one of many data quality problems you’ll run into, so it may be important to have a plan in place in case you hit this scenario.


William Vorhies discusses a new technical paper on Local Interpretable Model-Agnostic Explanations:

What the model actually used for classification were these: ‘posting’, ‘host’, ‘NNTP’, ‘EDU’, ‘have’, ‘there’.  These are meaningless artifacts that appear in both the training and test sets and have nothing to do with the topic except that, for example, the word “posting” (part of the email header) appears in 21.6% of the examples in the training set but only two times in the class “Christianity.”

Is this model going to generalize?  Absolutely not.

An Example from Image Processing

In this example using Google’s Inception NN on arbitrary images the objective was to correctly classify “tree frogs”.  The classifier was correct in about 54% of cases but also interpreted the image as a pool table (7%) and a balloon (5%).

Looks like an interesting paper.  Click through for a link to the paper.

The Spark Ecosystem

Kevin Feasel



Frank Evans gives an overview of what the Apache Spark ecosystem looks like:

The built-in machine learning library in Spark is broken into two parts: MLlib and KeystoneML.

  • MLlib: This is the principal library for machine learning tasks. It includes both algorithms and specialized data structures. Machine learning algorithms for clustering, regression, classification, and collaborative filtering are available. Data structures such as sparse and dense matrices and vectors, as well as supervised learning structures that act like vectors but denote the features of the data set from its labels, are also available. This makes feeding data into a machine learning algorithm incredibly straightforward and does not require writing a bunch of code to denote how the algorithm should organize the data inside itself.

  • KeystoneML: Like the oil pipeline it takes its name from, KeystoneML is built to help construct machine learning pipelines. The pipelines help prepare the data for the model, build and iteratively test the model, and tune the parameters of the model to squeeze out the best performance and capability.

Whereas Hadoop’s ecosystem is large and sprawling, the Spark ecosystem tends to be more tightly constrained.  The nice part about Spark is that it plays nicely with the Hadoop ecosystem—you can have a cluster or architecture with Spark and Hadoop-centric technologies (Storm, Kafka, Hive, Flume, etc. etc.) working together quite nicely.

Spark Notebook Workflows

Dave Wang, Eric Liang, and Maddie Schults introduce Notebook Workflows:

Notebooks are very helpful in building a pipeline even with compiled artifacts. Being able to visualize data and interactively experiment with transformations makes it much easier to write code in small, testable chunks. More importantly, the development of most data pipelines begins with exploration, which is the perfect use case for notebooks. As an example, Yesware regularly uses Databricks Notebooks to prototype new features for their ETL pipeline.

On the flip side, teams also run into problems as they use notebooks to take on more complex data processing tasks:

  • Logic within notebooks becomes harder to organize. Exploratory notebooks start off as simple sequences of Spark commands that run in order. However, it is common to make decisions based on the result of prior steps in a production pipeline – which is often at odds with how notebooks are written during the initial exploration.
  • Notebooks are not modular enough. Teams need the ability to retry only a subset of a data pipeline so that a failure does not require re-running the entire pipeline.

These are the common reasons that teams often re-implement notebook code for production. The re-implementation process is time-consuming, tedious, and negates the interactive properties of notebooks.

Those two reasons are why I’ve argued that you should sit down in front of a REPL and figure out what you’re doing with a particular data set.  Once you’ve got it figured out, perform the operations in a notebook for posterity and to replicate your actions later.  I’m curious to see how this gets adopted in practice.

Graphing Customer Churn

Fang Zhou and Wee Hyong Tok have released a case study on a telephone company’s customer churn:

In the case of telco customer churn, we collected a combination of the call detail record data and customer profile data from a mobile carrier, and then followed the data science process —  data exploration and visualization, data pre-processing and feature engineering, model training, scoring and evaluation — in order to achieve the churn prediction. With a churn indicator in the dataset taking value 1 when the customer is churned and taking value 0 when the customer is non-churned, we addressed the problem as a binary classification problem and tried varioustree-based models along with methods like bagging, random forests and boosting. Because the number of churned customers is much less than that of non-churned customers (making the data set quite unbalanced), SMOTE (Synthetic Minority Oversampling Technique) was applied to adjust the proportion of majority class over minority class in the training data set, thus further improving model performance, especially precision and recall.

All the above data science procedures could be implemented with base R. Rather than moving the data out from the database to an external machine running R, we instead run R scripts directly on SQL Server data by leveraging the in-database analytics capability provided by SQL Server R Services, taking advantage of the rich and powerful CRAN R packages plus the parallel external memory algorithms in the RevoScaleR library. In what follows, we will describe the specific R packages and algorithms that we used to implement the data science solution for predicting telco customer churn.

They have provided the relevant materials in GitHub as well.

SSIS Perfmon Counters

Lonny Niederstadt notes that you cannot see SSIS counters in Perfmon without administrative rights:

A colleague and I were hoping to review SSIS perfmon counters on a VM.  We use a logman command with a counters file to log perfmon to csv.

Opened up the csv that was captured on the VM… there were all of my typical SQL Server counters… but the following SSIS counters were missing.

\SQLServer:SSIS Service\SSIS Package Instances
\SQLServer:SSIS Pipeline\Buffer memory
\SQLServer:SSIS Pipeline\Buffers in use
\SQLServer:SSIS Pipeline\Buffers spooled
\SQLServer:SSIS Pipeline\Flat buffer memory
\SQLServer:SSIS Pipeline\Flat buffers in use
\SQLServer:SSIS Pipeline\Private buffer memory
\SQLServer:SSIS Pipeline\Private buffers in use
\SQLServer:SSIS Pipeline\Rows read
\SQLServer:SSIS Pipeline\Rows written


The more you know.

Data Migration Assistant

Kenneth Fisher reviews the new Data Migration Assistant:

First things first the DMA is a replacement of the Upgrade Adviser. In fact it’s an upgrade of the Upgrade Adviser. It has some amazing new features.

  • You can install this on a workstation. It doesn’t have to be installed on the server itself.

  • You can have multiple projects saved with data from multiple server/instances.

  • There is the option to get compatibility issues and/or new feature recommendations.

  • You can check issues/feature recommendations for upgrades to 2012, 2014 or 2016.

Looks like it’s a step up from the old Upgrade Advisor.

Reporting From Linked Servers

Dave Mason has a few scripts he uses to pull database metrics from his SQL Server instances:

There is a total of “N” linked servers. The query selects from each one and combines the results via UNION ALL. Here’s a sample of the output on my CMS:

What we’ve seen so far could be used as a template for other queries, such as a backup report, a database files report, etc. The code could be put in a stored procedure and run as needed for automated tasks. I’ve used similar code to retrieve data for an SSRS report.

Let’s examine a problem you’re likely to encounter by including @@SERVERNAME within the query

Give it a read.

Naming Procedures

Aaron Bretrand shares rules that he uses when naming stored procedures:

I’ve talked about this in my live sessions, but this is an extreme case that really happened – a team took over a week to fix a bug in a stored procedure, and the delay was caused solely by poor naming standards. What happened was that the application was calling dbo.Customer_Update, but the team was hunting for the bug in a different procedure, dbo.Update_Customer. While there was no formal convention in place, the real problem was inconsistency – a consultant charged with writing a different application didn’t check for an existing procedure, she just looked for dbo.Update_Customer in the list; when she didn’t find it, she wrote her own. The bug itself wasn’t crucial, but that lost time can never be recovered.

I’ll repeat again that the convention you choose is largely irrelevant, as long as it makes sense to you and your team, and you all agree on it – and abide by it. But I am asked frequently for advice on naming conventions, and for things like tables, I’m not going to get into religious arguments about plural vs. singular, the dreaded tblprefix, or going to great lengths to avoid vowels. But I think I have a pretty sensible standard for stored procedures, and I am always happy to share my biases even though I know not everyone will agree with them. Again, I touched on this in my earlier post, but sometimes these things bear repeating and a little elaboration.

I agree with this:  pick a standard and stick to it.


Kevin Feasel



Slava Murygin has a script to turn DBCC MEMORYSTATUS output into one result set:

When I wanted to research memory problem on a server and started to dig deeper into “DBCC MEMORYSTATUS” command.Very useful links to understand that command were from Microsoft:

During the research I’ve faced two problems:
1. I had to wait several seconds to get the full result set.
2. I had to scroll down and shuffle 115 different data sets to find the counter I want.

To eliminate both these problem all these different counters have to be in the same table/data set.
That will make research easier and data can be stored or compared with the base line.

Read on for the T-SQL script.


September 2016
« Aug Oct »