Using Spark MLlib For Categorization

Taras Matyashovskyy uses Apache Spark MLlib to categorize songs in different genres:

The roadmap for implementation was pretty straightforward:

  • Collect the raw data set of the lyrics (~65k sentences in total):

    • Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.
    • Abba, Ace of Base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.
  • Create training set, i.e. label (0 for metal | 1 for pop) + features (represented as double vectors)

  • Train logistic regression that is the obvious selection for the classification

This is a supervised learning problem, and is pretty fun to walk through.

SQLite With Powershell

Phil Factor combines SQLLite, Powershell, and SQL Server:

 Although I dearly love using SQL Server, I wouldn’t use it in every circumstance; there are times, for example, when just isn’t necessary to use a Server-based RDBMS for a data-driven application. The open-source SQLite is arguably the most popular and well-tried-and-tested database ever. It is probably in your phone, and used by your browser. Your iTunes will use it. Most single-user applications that need to handle data will use SQLite because it is so reliable and easy to install.

It is specifically designed as a zero-configuration, embedded, relational database with full ACID compliance, and a good simple dialect of SQL92. The SQLite library accesses its storage files directly, using a single library, written in C, which contains the entire database system. Creating a SQLite database instance is as easy as opening a simple cross-platform file that contains the entire database instance. It requires no administration.

There’s a lot going on in this interesting article; I recommend giving it a read.

Spark Clusters On Spot Pricing

Sameer Farooqui explains spot pricing with respect to AWS servers:

The idea behind Spot instances is to allow you to bid on spare Amazon EC2 compute capacity. You choose the max price you’re willing to pay per EC2 instance hour. If your bid meets or exceeds the Spot market price, you win the Spot instances. However, unlike traditional bidding, when your Spot instances start running, you pay the live Spot market price (not your bid amount). Spot prices fluctuate based on the supply and demand of available EC2 compute capacity and are specific to different regions and availability zones.

So, although you may have bid 0.55 cents per hour for a r3.2xlarge instance, you’ll end up paying only 0.10 cents an hour if that’s what the going rate is for the region and availability zone.

Databricks uses spot pricing for Community Edition clusters to control costs.  Click through for a very interesting discussion of spot pricing and how they take advantage of it.

Maximum Temperatures With Spark Languages

Kevin Feasel



Praveen Sripati has a two-part series on getting aggregates by year in various Spark languages.  In part one, he looks at Python:

Hadoop – The Definitive Guide revolves around the example of finding the maximum temperature for a particular year from the weather data set. The code for the same is here and the data here. Below is the Spark code implemented in Python for the same.

In part 2, he looks at Spark SQL:

In the previous blog, we looked at how find out the maximum temperature of each year from the weather dataset. Below is the code for the same using Spark SQL which is a layer on top of Spark. SQL on Spark was supported using Shark which is being replaced by Spark SQL.Here is a nice blog from DataBricks on the future of SQL on Spark.

There’s no Scala example here, but it’s pretty straightforward as well.

Spark And .NET

Kevin Feasel



Bharath Venkatesh shows how to make Spark calls using the .NET ODBC driver:


Before you begin, you must have the following:

Check it out.  Using Spark on .NET is pretty easy.

Price Optimization Using Decision Trees

Bernard Antwi Adabankah uses a decision tree to model price changes:

The sample included N = 262 individual orders for Interlocking Hearts Design Cake Knife/Server set with OrderItemSKU as 2401 from the period ranging from 1st March 2014 to 20th April 2016 with an ecommerce company which sells on

The Profit response variable is measured as the product sale price on which includes commission and any applicable postage costs less the purchase price of the Hearts Design Cake Knife/Server set from the supplier.

Read the comments for a couple good critiques of the article.

Capturing SSAS Query Activity

Bill Anton explains why and how he captures query activity by user in SSAS:

In most environments, it is trivial to obtain the name of the user who ran each query… all you have to do was capture the [QueryEnd] event in a profiler/xevent trace and pull the information from the [NTUserName] field. However, in environments involving Power BI and the Enterprise On-Premise Data Gateway, there’s a bit more to it.

The main issue is how authentication is handled in this type of architecture. When working with Power BI reports connected to an on-premise data source via the On-Premise Data Gateway, the account of the user running the report is passed as the “EffectiveUsername”. The implication here is that the value shown in the [NTUserName] field of the xevent/profiler trace is going to be the Data Gateway account – NOT the account of the user who actually generated the activity.

Read on for the full answer.

Check Those Estimates

Grant Fritchey runs into a statistics issue:

While the number of rows for 1048 was the lowest, at 3, unfortunately it seems that the 1048 values were added to the table after the statistics for the index had been updated. Instead of using something from the histogram, my value fell outside the values in the histogram. When the value is outside histogram the Cardinality Estimator uses the average value across the entire histogram, 258.181 (at least for any database that’s in SQL Server 2014 or greater and not running in a compatibility mode), as the row estimate.

Figuring out those boundaries can make the difference between a good plan and a bad plan.

Developing In The Cloud

Richie Rump has some nice pointers about developing for Azure or AWS:

Since we’re a bunch of data freaks, we wanted to make sure that our data and files are properly backed up. I set out to create a script that will backup DynamoDB to a file and copy the data in S3 to Azure. The reasoning for saving our backups into a different cloud provider is pretty straightforward. First, we wanted to keep the data in a separate cloud account from the application. We didn’t make the same mistakes that Code Spaces did. Secondly, I wanted to kick the tires of Azure a bit. Heck, why not?

I figure this script would take me a day to write and a morning to deploy. In the end it took four days to write and deploy. So here are some lessons that I learned the hard way from trying to bang out this backup code.

This is a must-read if you’re starting to look at using cloud providers for services.

Optimizing Large Documents For Space

Raul Gonzalez drops a 2 TB table’s size in half:

So at work, I’d say space matters, and in order to optimize our storage requirements it’s very important to know about SQL Server internals, specially the Storage Engine, which happens to be one of my favorite topics of study.

In my quest to release some space I got to this database, just one table which is 165M of XML documents stored as NVARCHAR(MAX).

It was interesting walking through the process.  Some part of me wonders if it’s a bit complex for the next maintainer to handle, but saving a terabyte of disk space is worth a few extra pages of documentation…


November 2016
« Oct Dec »