The Data Science Delusion

Anand Ramanathan has a strong critique of “data science” as it stands today:

Illustration: Consider the sentiment-tagging task again. A Q1 resource uses an off-the-shelf model for movie reviews, and applies it to a new task (say, tweets about a customer service organization). Business is so blinded by spectacular charts [14] and anecdotal correlations (“Look at that spiteful tweet from a celebrity … so that’s why the sentiment is negative!”), that even questions about predictive accuracy are rarely asked until a few months down the road when the model is obviously floundering. Then too, there is rarely anyone to challenge the assumptions, biases and confidence intervals (Does the language in the tweets match the movie reviews? Do we have enough training data? Does the importance of tweets change over time?).

Overheard“Survival analysis? Never heard of it … Wait … There is an R package for that!”

This is a really interesting article and I recommend reading it.

Building A Hadoop Cluster

I have a post on building a five-node Hadoop cluster using Docker containers:

Notice how 3bd shows up for pretty much all of these services.  This is not what you’d want to do in a real production environment, but because we want to use Docker and easily pass ports through, it’s the simplest way for me to set this up.  If you knew beforehand which node would host which service, you could modify the batch script that we discussed earlier and open those specific ports.

After assigning masters, we next have to define which nodes are clients in which clusters.

Click through for a screenshot-laden walkthrough.

Learning About HCatalog

Kevin Feasel



Adam Diaz explains the basics of HCatalog:

HCatalog, also called HCat, is an interesting Apache project. It has the unique distinction of being one of the few Apache projects that were once a part of another project, became its own project, and then again returned to the original project Apache Hive.

HCat itself is described in the documentation as “a table and storage management layer” for Hadoop. In short, HCat provides an abstraction layer for accessing data in Hive from a variety of programming languages.  It exposes data stored in the Hive metastore to additional languages other than HQL. Classically, this has included Pig and MapReduce. When Spark burst onto the big data scene, it allowed access to HCat.

Given HDInsight’s predilection toward WebHCat over WebHDFS, this does seem like a good thing to learn.

Understanding Status In DMVs

Ewald Cress looks at a number of DMVs and how they expose query status:

sys.dm_exec_sessions: status

This metric is completely disjunct from the above ones, and mostly reflects attributes of a CSession class instance. The respective values are derived through the following decision tree:

  • If the internal Boolean member m_fIsConnReset is set, return dormant

  • Else if a flag living outside of CSession itself is set, return preconnect (I’ll touch on the source of this mystery flag below)

  • Else if a flag within the CSession itself is set (indicating that it has been provided some work to do) return running

  • Else return sleeping

It’s interesting to see how something so very similar can have so many different understandings.

Azure VM Auto-Shutdown

Dave Bermingham shows how to configure automatic shutdown of Azure VMs:

If you are like me, I try to make my Azure MSDN subscription credits stretch the entire month. I’m typically just building labs to try out new features or to demonstrate SQL Server Failover Clusters in Azure. A lot of the time I am testing some pretty large instance sizes with plenty of premium storage. As you can imagine, you can burn through $150 pretty quick with a few GS5 instances running.

I try to be mindful and shutdown or destroy instances once I am done with them, but occasionally I’ll get pulled away for other business, only to log in the next day and see my credit has expired because I forgot to turn off the VMs.

Click through for details, including a warning about storage.


Kevin Feasel



Andy Levy discusses the SET NOCOUNT operation:

So what happened here?  When you execute a query against SQL Server, both your data and some additional information is sent back to the client. This additional information is sent via a separate channel which is accessible via the SqlConnection.InfoMessages (or if you’re still using classic ADO, the InfoMessage) event. When you run queries in SSMS, you see this information in the Messages tab of the results pane most often as X row(s) affected.

That’s where my new stored procedures were causing problems. Where the original procedures were returning only one event which corresponded to the number of records returned by the single query in each procedure. But now that I’m loading temp tables, I’m getting multiple messages back – at a minimum, a count of the records affected when loading the temp table plus a count of the records returned to the calling application.

Data layers which can’t handle information streams are rare, but they do show up in the wild sometimes.

Scheduling VM Backups

Jens Vestergaard shows how to schedule Azure VM backups:

In this wizard we are presented with three (3) areas of configuration; First we need to decide if it’s in Azure or On-Premises. By selecting Azure, we are left with only Virtual Machine as the only option for the backup. On-Premises has more options, SQL Server, Sharepoint and Hyper-V VM’s among others. This example will be about Azure VM’s, hence we selected accordingly.

Step 2 is about the backup policy, or in other words frequency and retention. I am going with the default settings here, but options are great as you can configure retention range for weekly, monthly and yearly backups in parallel.

It’s easy and like any other backups, might save your bacon later.

Numeric Ranges In Power Query

Chris Webb solves one form of the gaps and islands problem in Power Query:

In a comment on my post on Creating Sequences of Integers And Characters In Power BI/Power Query Lists a reader, Paul G, asked me the following question:

can you reverse this? e.g i have a list (1,2,3,5,7,8,9,12,13,14,15) can i convert this to (1-3, 5 ,7-9,12-15)

This got me thinking… I was sure it could be done in M, but would it be possible using just the UI? As far as I can see, it isn’t – there’s one crucial thing I can’t do – but I would be interested to see if anyone else can come up with a no-code solution.

Check out the comments for other attempts at solving the problem with as little code as possible.

SQL Server EXE Size

Arun Sirpal points out that sqlservr.exe is a lot smaller in 2012 and up as compared  to 2008 R2:

I never really noticed the difference before, but I understand why.

From 2012 onwards the architecture changed, it has been broken up into multiple DLLs. I can see the extra DLL files within the BINN folder these being sqllang.dll and sqlmin.dll where each are roughly 30MB each.

Makes me a bit curious as to the reason behind the breakout.


November 2016
« Oct Dec »