Press "Enter" to skip to content

Month: September 2016

Arrays And Lists In SQL Server

Erland Sommarskog has updated his essay on Arrays and Lists in SQL Server.  He’s broken it down into a few parts.  First, the short version:

Now you know why IN (@list) does not work as you hoped for, but if you have a comma-separated list you still need to know to work with it.

The best approach in my opinion is to reconsider having a comma-separated list at all. After all, you are in a relational database, so why not use a table instead? That is, you should pass the data in a table-valued parameter (TVP) instead of that comma-separated list. If you have never used TVPs before, I have an article, Using Table-Valued Parameters in SQL Server and .NET, where I give a tutorial of passing TVPs from .NET to SQL Server, and there is a detailed description exactly of the case of passing a comma-separated list to a TVP. You will find that it is astonishly simple.

Unfortunately, not all environments support TVPs – Entity Framework has no real support for TVPs, reportedly nor has Reporting Services. The same applies if you are on SQL 2005 or earlier, since TVPs were added in SQL 2008. Or you may just be plain stubborn and want to use your comma-separated list. Or you are simply pressed for time, and don’t have the time to learn something new right now.

If you want a longer article on using table-valued parameters, Erland has one of those as well:

This is an article that is intended to get you started with passing table-valued parameters (TVPs) from SQL Server to .NET. I begin with presenting how you use table-valued parameters in SQL Server itself whereupon I give a quick overview of the mechanisms to pass TVPs from ADO .NET to SQL Server.

The main meat of this article are two real-world examples where I use TVPs. The first example is the classical problem of passing a comma-separated list of values to SQL Server, this time through a table-valued parameter. You will be amazed of how simple it is. In the second example I show two ways to load a file with master-detail data into tables in SQL Server. In addition to the examples, there is also some discussion on how you can improve performance when loading large amounts of data.

Despite the appearance of .NET in the title of this article, there is a final chapter that explores the possibilities in other APIs, of which some and some do not support TVPs. This includes Entity Framework which has no for support TVPs. In this chapter I briefly discuss workarounds when TVPs are not available to you.

And for the advanced look at arrays and lists, you have the long-form article:

A problem that has been popular over the years with SQL Server is how to handle a list of values. In the majority of the cases, people have a comma-separated list, because this format is produced by commonly used tools like multi-choice controls in .NET, Reporting Services and other places.

When I say that the problem is popular, I don’t only mean that the questions are commonplace – but so are solutions. You can find no end of blog posts etc that presents string-splitting functions, including performance tests of such functions and there are function that are known to be the fastest etc.

The aim of this article is two-fold: 1) Give a general discussion of how to design string-splitting functions. 2) Present and discuss each method from the angles I bring up in the general discussion. This includes performance, but not only.

Even if you’ve read this article before, it’s worth checking again to refresh your memory and to see his changes.

Comments closed

Labor Day

Today is Labor Day in the United States.  Because most Curated SQL readers have the day off, I’m going to link to some longer-form and more timeless material.

Comments closed

Data Quality

Milind Paradkar discusses clean data:

We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.

NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.

I like this post because it exposes a data quality issue people don’t tend to think about very often:  when all of the data is legitimate and correctly-structured, but there are gaps in the available data set.  This is one of many data quality problems you’ll run into, so it may be important to have a plan in place in case you hit this scenario.

Comments closed

LIME

William Vorhies discusses a new technical paper on Local Interpretable Model-Agnostic Explanations:

What the model actually used for classification were these: ‘posting’, ‘host’, ‘NNTP’, ‘EDU’, ‘have’, ‘there’.  These are meaningless artifacts that appear in both the training and test sets and have nothing to do with the topic except that, for example, the word “posting” (part of the email header) appears in 21.6% of the examples in the training set but only two times in the class “Christianity.”

Is this model going to generalize?  Absolutely not.

An Example from Image Processing

In this example using Google’s Inception NN on arbitrary images the objective was to correctly classify “tree frogs”.  The classifier was correct in about 54% of cases but also interpreted the image as a pool table (7%) and a balloon (5%).

Looks like an interesting paper.  Click through for a link to the paper.

Comments closed

The Spark Ecosystem

Frank Evans gives an overview of what the Apache Spark ecosystem looks like:

The built-in machine learning library in Spark is broken into two parts: MLlib and KeystoneML.

  • MLlib: This is the principal library for machine learning tasks. It includes both algorithms and specialized data structures. Machine learning algorithms for clustering, regression, classification, and collaborative filtering are available. Data structures such as sparse and dense matrices and vectors, as well as supervised learning structures that act like vectors but denote the features of the data set from its labels, are also available. This makes feeding data into a machine learning algorithm incredibly straightforward and does not require writing a bunch of code to denote how the algorithm should organize the data inside itself.

  • KeystoneML: Like the oil pipeline it takes its name from, KeystoneML is built to help construct machine learning pipelines. The pipelines help prepare the data for the model, build and iteratively test the model, and tune the parameters of the model to squeeze out the best performance and capability.

Whereas Hadoop’s ecosystem is large and sprawling, the Spark ecosystem tends to be more tightly constrained.  The nice part about Spark is that it plays nicely with the Hadoop ecosystem—you can have a cluster or architecture with Spark and Hadoop-centric technologies (Storm, Kafka, Hive, Flume, etc. etc.) working together quite nicely.

Comments closed

Spark Notebook Workflows

Dave Wang, Eric Liang, and Maddie Schults introduce Notebook Workflows:

Notebooks are very helpful in building a pipeline even with compiled artifacts. Being able to visualize data and interactively experiment with transformations makes it much easier to write code in small, testable chunks. More importantly, the development of most data pipelines begins with exploration, which is the perfect use case for notebooks. As an example, Yesware regularly uses Databricks Notebooks to prototype new features for their ETL pipeline.

On the flip side, teams also run into problems as they use notebooks to take on more complex data processing tasks:

  • Logic within notebooks becomes harder to organize. Exploratory notebooks start off as simple sequences of Spark commands that run in order. However, it is common to make decisions based on the result of prior steps in a production pipeline – which is often at odds with how notebooks are written during the initial exploration.
  • Notebooks are not modular enough. Teams need the ability to retry only a subset of a data pipeline so that a failure does not require re-running the entire pipeline.

These are the common reasons that teams often re-implement notebook code for production. The re-implementation process is time-consuming, tedious, and negates the interactive properties of notebooks.

Those two reasons are why I’ve argued that you should sit down in front of a REPL and figure out what you’re doing with a particular data set.  Once you’ve got it figured out, perform the operations in a notebook for posterity and to replicate your actions later.  I’m curious to see how this gets adopted in practice.

Comments closed

Graphing Customer Churn

Fang Zhou and Wee Hyong Tok have released a case study on a telephone company’s customer churn:

In the case of telco customer churn, we collected a combination of the call detail record data and customer profile data from a mobile carrier, and then followed the data science process —  data exploration and visualization, data pre-processing and feature engineering, model training, scoring and evaluation — in order to achieve the churn prediction. With a churn indicator in the dataset taking value 1 when the customer is churned and taking value 0 when the customer is non-churned, we addressed the problem as a binary classification problem and tried varioustree-based models along with methods like bagging, random forests and boosting. Because the number of churned customers is much less than that of non-churned customers (making the data set quite unbalanced), SMOTE (Synthetic Minority Oversampling Technique) was applied to adjust the proportion of majority class over minority class in the training data set, thus further improving model performance, especially precision and recall.

All the above data science procedures could be implemented with base R. Rather than moving the data out from the database to an external machine running R, we instead run R scripts directly on SQL Server data by leveraging the in-database analytics capability provided by SQL Server R Services, taking advantage of the rich and powerful CRAN R packages plus the parallel external memory algorithms in the RevoScaleR library. In what follows, we will describe the specific R packages and algorithms that we used to implement the data science solution for predicting telco customer churn.

They have provided the relevant materials in GitHub as well.

Comments closed

SSIS Perfmon Counters

Lonny Niederstadt notes that you cannot see SSIS counters in Perfmon without administrative rights:

A colleague and I were hoping to review SSIS perfmon counters on a VM.  We use a logman command with a counters file to log perfmon to csv.

Opened up the csv that was captured on the VM… there were all of my typical SQL Server counters… but the following SSIS counters were missing.

\SQLServer:SSIS Service\SSIS Package Instances
\SQLServer:SSIS Pipeline\Buffer memory
\SQLServer:SSIS Pipeline\Buffers in use
\SQLServer:SSIS Pipeline\Buffers spooled
\SQLServer:SSIS Pipeline\Flat buffer memory
\SQLServer:SSIS Pipeline\Flat buffers in use
\SQLServer:SSIS Pipeline\Private buffer memory
\SQLServer:SSIS Pipeline\Private buffers in use
\SQLServer:SSIS Pipeline\Rows read
\SQLServer:SSIS Pipeline\Rows written

Huh.

The more you know.

Comments closed

Data Migration Assistant

Kenneth Fisher reviews the new Data Migration Assistant:

First things first the DMA is a replacement of the Upgrade Adviser. In fact it’s an upgrade of the Upgrade Adviser. It has some amazing new features.

  • You can install this on a workstation. It doesn’t have to be installed on the server itself.

  • You can have multiple projects saved with data from multiple server/instances.

  • There is the option to get compatibility issues and/or new feature recommendations.

  • You can check issues/feature recommendations for upgrades to 2012, 2014 or 2016.

Looks like it’s a step up from the old Upgrade Advisor.

Comments closed

Reporting From Linked Servers

Dave Mason has a few scripts he uses to pull database metrics from his SQL Server instances:

There is a total of “N” linked servers. The query selects from each one and combines the results via UNION ALL. Here’s a sample of the output on my CMS:

What we’ve seen so far could be used as a template for other queries, such as a backup report, a database files report, etc. The code could be put in a stored procedure and run as needed for automated tasks. I’ve used similar code to retrieve data for an SSRS report.

Let’s examine a problem you’re likely to encounter by including @@SERVERNAME within the query

Give it a read.

Comments closed