Press "Enter" to skip to content

Month: April 2016

Installing ODBC Drivers

Steph Locke shows how to install the SQL Server ODBC drivers in Ubuntu:

Did you know you can now get SQL Server ODBC drivers for Ubuntu? Yes, no, maybe? It’s ok even if you haven’t since it’s pretty new! Anyway, this presents me with an ideal opportunity to standardise my SQL Server ODBC connections across the operating systems I use R on i.e. Windows and Ubuntu. My first trial was to get it working on Travis-CI since that’s where all my training magic happens and if it can’t work on a clean build like Travis, then where can it work?

Now I can create R functionality that can reliably depend on SQL Server without having to fallback to JDBC. A definite woohoo moment!

Thanks to Steph for putting together this script.

Comments closed

Restoration Options

Richie Lee covers the WITH RECOVERY, WITH NORECOVERY, and WITH STANDBY database restoration options:

Similar to NORECOVERY except that the database will accept read only connections. To do this any uncommitted transactions in the backup will be rolled back and stored in a transaction undo file (tuf.) Whilst users are running queries against the database no further restores can continue until all queries are complete (though this is not the case with log shipping.) When the next restore occurs, those uncommitted transactions in the tuf file will be rolled forward and the next log is restored.

Diving into STANDBY mode was quite helpful.  I’ve never needed to restore a database into standby mode, but could see it being useful for bringing back deleted records.

Comments closed

Migrating TFS

Dave Mason has notes on migrating TFS from one server to another:

If you are migrating, but want to keep the databases on SQL 2012 Express, then you can skip this part. I wanted them moved to my SQL 2014 instance. So I did a traditional backup/restore from SQL 2012 Express to SQL 2014. I took new backups of the SQL 2014 databases, and then uninstalled SQL 2012 Express. Then I had to configure TFS to connect to a different SQL instance. Within the web.config file (%ProgramFiles%\Microsoft Team Foundation Server 12.0\Application Tier\Web Services\web.config), I found an application setting named “applicationDatabase”. I made a backup copy of web.config first, then I changed the “applicationDatabase” value. It should be in a recognizable format if you’re familiar with SQL Server connection strings. You can also make this change within IIS. It was there that I noticed a few other settings that contained SQL connection strings. Check out the following in IIS and change settings as needed:

Dave has lots of screen shots to make the process easier to understand, but my main takeaway is that for the most part, migrating TFS  is a huge pain…

Comments closed

Spark + R Webinar

David Smith points out a recent webinar on combining Microsoft R Server with HDInsight:

As Mario Inchiosa and Roni Burd demonstrate in this recorded webinar, Microsoft R Server can now run within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR (pdf) take advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. And if your data grows or you just need more power, you can dynamically add nodes to the HDInsight cluster using the Azure portal.

I don’t normally link to webinars (because they tend to violate my “should be viewable in a coffee break” rule of thumb) but I have a soft spot in my heart for these technologies.  If you want to dig into more “mainstream” (off the Microsoft platform) Spark + R fun, check out SparkR.

Comments closed

Stopping SQL Agent Jobs

Chris Shaw shows how to stop SQL Agent jobs programmatically:

SQL Server has a number of system stored procedures that you can use to perform tasks that you might be doing in the user interface, for example… If you want to stop a job you can open SQL Server Management Studio, navigate to the job, right click and stop the job.  Here is where the system supplied stored procedure comes into play.  What if your busy time of the day is at 6 AM, and you want to make sure that the indexing has finished by 5:00 AM so that the system is ready to take on the day.  Do you really want to wake up at 5:00 AM just to right click and stop job, in the chance that it is running?

The answer to Chris’s question is no, I’d much rather not wake up at 5 AM to stop a job if it’s running.  This is why we have computers, to do that sort of thing for us.

Comments closed

Exploring Taxi Data

David Smith ties together two of my favorite technologies in R and Hadoop to analyze New York City taxi data:

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

If you’re looking for a data set for exploration, this is certainly a good one.

Comments closed

Stats Terminology

Erik Darling fills in gaps on statistics terminology in his unique style:

SELECTIVITY

This tells you how special your snowflakes are. When a column is called “highly selective” that usually means values aren’t repeating all that often, if at all. Think about order numbers, identity or sequence values, GUIDs, etc.

DENSITY

This is sort of the anti-matter to selectivity. Highly dense columns aren’t very unique. They’ll return a lot of rows for a given value. Think about Zip Codes, Gender, Marital Status, etc. If you were to select all the people in 10002, a densely (there’s that word again) populated zip code in Chinatown, you’d probably wait a while, kill the query, and add another filter.

Combine that with Kendra Little’s statistics FAQ for additional learning.

Comments closed

Error Handling In Service Broker

Colleen Morrow shows how to handle poison messages and other errors in Service Broker:

This type of situation, a message that can never be processed successfully, is known as a poison message.  The name kind of makes it sound like there’s a problem with the message itself.  And there might be.  Perhaps the message format is wrong for what the receiving code was expecting.  But maybe the problem is with the receiving code itself.  Regardless of what causes the poison message, it has to be dealt with.

SQL Server has a built-in mechanism for handling poison messages.  If a transaction that receives a message rolls back 5 times, SQL Server will disable the queue.  So that means that all processing that depends on that queue will cease.  Nice, huh?  Because of this, it behooves you to make sure you include proper error handling in your message processing code.  And how exactly you handle errors will depend on several factors:

Handling errors safely is a huge part of asynchronous programming.

Comments closed