Press "Enter" to skip to content

Day: April 20, 2016

Spark + R Webinar

David Smith points out a recent webinar on combining Microsoft R Server with HDInsight:

As Mario Inchiosa and Roni Burd demonstrate in this recorded webinar, Microsoft R Server can now run within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR (pdf) take advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. And if your data grows or you just need more power, you can dynamically add nodes to the HDInsight cluster using the Azure portal.

I don’t normally link to webinars (because they tend to violate my “should be viewable in a coffee break” rule of thumb) but I have a soft spot in my heart for these technologies.  If you want to dig into more “mainstream” (off the Microsoft platform) Spark + R fun, check out SparkR.

Comments closed

Stopping SQL Agent Jobs

Chris Shaw shows how to stop SQL Agent jobs programmatically:

SQL Server has a number of system stored procedures that you can use to perform tasks that you might be doing in the user interface, for example… If you want to stop a job you can open SQL Server Management Studio, navigate to the job, right click and stop the job.  Here is where the system supplied stored procedure comes into play.  What if your busy time of the day is at 6 AM, and you want to make sure that the indexing has finished by 5:00 AM so that the system is ready to take on the day.  Do you really want to wake up at 5:00 AM just to right click and stop job, in the chance that it is running?

The answer to Chris’s question is no, I’d much rather not wake up at 5 AM to stop a job if it’s running.  This is why we have computers, to do that sort of thing for us.

Comments closed

Exploring Taxi Data

David Smith ties together two of my favorite technologies in R and Hadoop to analyze New York City taxi data:

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

If you’re looking for a data set for exploration, this is certainly a good one.

Comments closed

Stats Terminology

Erik Darling fills in gaps on statistics terminology in his unique style:

SELECTIVITY

This tells you how special your snowflakes are. When a column is called “highly selective” that usually means values aren’t repeating all that often, if at all. Think about order numbers, identity or sequence values, GUIDs, etc.

DENSITY

This is sort of the anti-matter to selectivity. Highly dense columns aren’t very unique. They’ll return a lot of rows for a given value. Think about Zip Codes, Gender, Marital Status, etc. If you were to select all the people in 10002, a densely (there’s that word again) populated zip code in Chinatown, you’d probably wait a while, kill the query, and add another filter.

Combine that with Kendra Little’s statistics FAQ for additional learning.

Comments closed

Error Handling In Service Broker

Colleen Morrow shows how to handle poison messages and other errors in Service Broker:

This type of situation, a message that can never be processed successfully, is known as a poison message.  The name kind of makes it sound like there’s a problem with the message itself.  And there might be.  Perhaps the message format is wrong for what the receiving code was expecting.  But maybe the problem is with the receiving code itself.  Regardless of what causes the poison message, it has to be dealt with.

SQL Server has a built-in mechanism for handling poison messages.  If a transaction that receives a message rolls back 5 times, SQL Server will disable the queue.  So that means that all processing that depends on that queue will cease.  Nice, huh?  Because of this, it behooves you to make sure you include proper error handling in your message processing code.  And how exactly you handle errors will depend on several factors:

Handling errors safely is a huge part of asynchronous programming.

Comments closed