Spark + R Webinar

Kevin Feasel

2016-04-20

Hadoop, R, Spark

David Smith points out a recent webinar on combining Microsoft R Server with HDInsight:

As Mario Inchiosa and Roni Burd demonstrate in this recorded webinar, Microsoft R Server can now run within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR (pdf) take advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. And if your data grows or you just need more power, you can dynamically add nodes to the HDInsight cluster using the Azure portal.

I don’t normally link to webinars (because they tend to violate my “should be viewable in a coffee break” rule of thumb) but I have a soft spot in my heart for these technologies.  If you want to dig into more “mainstream” (off the Microsoft platform) Spark + R fun, check out SparkR.

Stopping SQL Agent Jobs

Chris Shaw shows how to stop SQL Agent jobs programmatically:

SQL Server has a number of system stored procedures that you can use to perform tasks that you might be doing in the user interface, for example… If you want to stop a job you can open SQL Server Management Studio, navigate to the job, right click and stop the job.  Here is where the system supplied stored procedure comes into play.  What if your busy time of the day is at 6 AM, and you want to make sure that the indexing has finished by 5:00 AM so that the system is ready to take on the day.  Do you really want to wake up at 5:00 AM just to right click and stop job, in the chance that it is running?

The answer to Chris’s question is no, I’d much rather not wake up at 5 AM to stop a job if it’s running.  This is why we have computers, to do that sort of thing for us.

Exploring Taxi Data

Kevin Feasel

2016-04-20

Hadoop, R

David Smith ties together two of my favorite technologies in R and Hadoop to analyze New York City taxi data:

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

If you’re looking for a data set for exploration, this is certainly a good one.

Stats Terminology

Erik Darling fills in gaps on statistics terminology in his unique style:

SELECTIVITY

This tells you how special your snowflakes are. When a column is called “highly selective” that usually means values aren’t repeating all that often, if at all. Think about order numbers, identity or sequence values, GUIDs, etc.

DENSITY

This is sort of the anti-matter to selectivity. Highly dense columns aren’t very unique. They’ll return a lot of rows for a given value. Think about Zip Codes, Gender, Marital Status, etc. If you were to select all the people in 10002, a densely (there’s that word again) populated zip code in Chinatown, you’d probably wait a while, kill the query, and add another filter.

Combine that with Kendra Little’s statistics FAQ for additional learning.

Error Handling In Service Broker

Colleen Morrow shows how to handle poison messages and other errors in Service Broker:

This type of situation, a message that can never be processed successfully, is known as a poison message.  The name kind of makes it sound like there’s a problem with the message itself.  And there might be.  Perhaps the message format is wrong for what the receiving code was expecting.  But maybe the problem is with the receiving code itself.  Regardless of what causes the poison message, it has to be dealt with.

SQL Server has a built-in mechanism for handling poison messages.  If a transaction that receives a message rolls back 5 times, SQL Server will disable the queue.  So that means that all processing that depends on that queue will cease.  Nice, huh?  Because of this, it behooves you to make sure you include proper error handling in your message processing code.  And how exactly you handle errors will depend on several factors:

Handling errors safely is a huge part of asynchronous programming.

Adding Perspective Descriptions

Bill Anton shows how to add descriptions to perspectives in SSAS Tabular:

Unfortunately, in Tabular projects, there’s no direct way to add a description for perspectives – not even withBIDS Helper.

In Tabular projects, there is a (modal) popup window for managing perspectives…

The answer is not a great one, so hopefully the SSAS team picks up on this and improves the Tabular experience.

Statistics FAQ

Kendra Little has a great FAQ on statistics, from the standpoint of developers as well as administrators:

I’ve been asked a lot of questions about updating statistics in SQL Server over the years. And I’ve asked a lot of questions myself! Here’s a rundown of all the practical questions that I tend to get about how to maintain these in SQL Server.

I don’t dig into the internals of statistics and optimization in this post. If you’re interested in that, head on over and read the fahhhbulous white paper, Statistics Used by the Query Optimizer in SQL Server 2008. Then wait a couple days and chase it with it’s charming cousin, Optimizing Your Query Plans with the SQL Server 2014 Cardinality Estimator.

I’m also not talking about statistics for memory optimized tables in this article. If you’re interested in those, head over here.

This is a great distillation of important and oft-misunderstood content.

Configuring Polybase

David Benoit walks through how to configure a Polybase cluster:

When you get to the PolyBase configuration screen, you have the option to run this as a standalone instance, or to use the SQL Server as part of a scale-out group. Personally, I don’t think that anyone should ever choose the standalone instance option. You can always run a SQL Server configured for a scale-out group as a standalone instance, BUT you can’t change (at least not today) a SQL Server configured as a standalone PolyBase instance to run as part of a PolyBase cluster once you have completed the install.

This note sounds like the argument for clustering all SQL Server instances.

New Columnstore Extended Events

Niko Neugebauer talks about extended events relating to columnstore indexes in SQL Server 2016:

In SQL Server 2014 we have had 18 Extended Events and with Service Pack 1 we have received 1 more to be a total of 19 Extended Events for studying the Columnstore Indexes and the Batch Mode processing. In SQL Server 2016 that number has been greatly increased – there are whooping 61 Extended Events, that will give us an important insight into the Columnstore Indexes.

Even more important, Sunil & his team have given an own category inside the Extended Events – a category that is named Columnstore, which will ease the search for the basic columnstore events. Be aware though not all Extended Events related to Columnstore Indexes are included in that category – even including all channels will give you 41 Extended Events, while hiding the other 20 Extended Events, which are sometimes not categorised at all and at other times are stored under different categories, such as Execution or Error, for example. I believe the reason behind not changing the old Extended Events category is quite simple – Microsoft always looks for avoiding breaking existing applications.

There’s a lot here to digest, so read the whole thing.

Correlated Subqueries

Kevin Feasel

2016-04-19

T-SQL

Jen McCown explains correlated subqueries by way of an error message:

What the what?? I literally JUST ran a query exactly like this, but without the join. I haven’t mixed aggregate and non-aggregate columns in the query without a GROUP BY…the only aggregate is in the subquery, and it’s all by its little lonesome!

It’s funny what one little letter can do to you.

Categories

November 2018
MTWTFSS
« Oct  
 1234
567891011
12131415161718
19202122232425
2627282930