Press "Enter" to skip to content

Category: R

Spark + R Webinar

David Smith points out a recent webinar on combining Microsoft R Server with HDInsight:

As Mario Inchiosa and Roni Burd demonstrate in this recorded webinar, Microsoft R Server can now run within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR (pdf) take advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. And if your data grows or you just need more power, you can dynamically add nodes to the HDInsight cluster using the Azure portal.

I don’t normally link to webinars (because they tend to violate my “should be viewable in a coffee break” rule of thumb) but I have a soft spot in my heart for these technologies.  If you want to dig into more “mainstream” (off the Microsoft platform) Spark + R fun, check out SparkR.

Comments closed

Exploring Taxi Data

David Smith ties together two of my favorite technologies in R and Hadoop to analyze New York City taxi data:

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

If you’re looking for a data set for exploration, this is certainly a good one.

Comments closed

Collapsing Lists In R

Steph Locke shows how to collapse a list of data frames into a single data table:

With my HIBPwned package, I consume the HaveIBeenPwned API and return back a list object with an element for each email address. Each element holds a data.frame of breach data or a stub response with a single column data.frame containing NA. Elements are named with the email addresses they relate to. I had a list of data.frames and I wanted a consolidated data.frame (well, I always want a data.table).

Enter data.table …

data.table has a very cool, and very fast function named rbindlist(). This takes a list of data.frames and consolidates them into one data.table, which can, of course, be handled as a data.frame if you didn’t want to use data.table for anything else.

Something that continuously amazes me with R is just how terse the language can be without collapsing into Perl.

Comments closed

Jupyter Notebooks With R

Andrie de Vries notes that Azure Machine Learning now supports Jupyter Notebooks with R:

I wrote about Jupyter Notebooks in September 2015 (Using R with Jupyter Notebooks), where I noted some of the great benefits of using notebooks:

  • Jupyter is an easy to use and convenient way of mixing code and text in the same document.

  • Unlike other reporting systems like RMarkdown and LaTex, Jupyter notebooks are interactive – you can run the code snippets directly in the document

  • This makes it easy to share and publish code samples as well as reports.

Jupyter Notebooks is a fine application, but up until now, you could only integrate it with Azure Machine Learning if you were writing Python code.  This move is a big step forward for Azure ML.

Comments closed

satRdays

Steph Locke notes that the R Consortium has agreed to support satRdays:

I’m very pleased to say that the R Consortium agreed to the support the satRday project!

The idea kicked off in November and I was over the moon with the response from the community, then we garnered support before submitting to the Consortium and I must have looped the moon a few times as we had more than 500 responses. Now the R Consortium are supporting us and we can turn all that enthusiasm into action.

This is great.  I’m looking forward to this taking off and being a nice complement to SQL Saturdays in cities.

Comments closed

HIBPwned

Steph Locke has created an R package to query Troy Hunt’s Have I Been Pwned? site:

The answer in life to the inevitable question of “How can I do that in R?” should be “There’s a package for that”. So when I wanted to query HaveIBeenPwned.com (HIBP) to check whether a bunch of emails had been involved in data breaches and there wasn’t an R package for HIBP, it meant that the responsibility for making one landed on my shoulders. Now, you can see if your accounts are at risk with the R package for HaveIBeenPwned.com, HIBPwned.

This is a nice confluence of two fun topics, so of course I like it.

Comments closed

R And SSH Tunnels

Steph Locke shows how to set up an SSH tunnel to connect to an external server within R:

Whilst down the rabbit hole, I discovered just in passing via a beanstalk article that there’s actually been a command line interface for PuTTY called plink. D’oh! This changed the whole direction of the solution to what I present throughout.

Using plink.exe as the command line interface for PuTTY we can then connect to our remote network using the key pre-authenticated via pageant. As a consequence, we can now use the shell() command in R to use plink. We can then connect to our database using the standard Postgres driver.

PuTTY is a must-have for any Windows box.

Comments closed

Mockaroo

Steph Locke tells us about a way to mock data for R:

Mockaroo is a really impressive service with a wide spread of different data types. They also have simple ways of adding things like within group differences to data so that you can mock realistic class differences. They use the freemium model so you can get a thousand rows per download, which is pretty sweet. The big BUT you can feel coming on is this – it’s a GUI! I don’t want to have spend time hand cranking a data extract.

Thankfully, they have a GUI for getting data too and it’s pretty simply to use so I’ve started making a package for it.

Steph is working on an R package, so this is pretty exciting.

Comments closed

R Tools For Visual Studio Launched

R now integrates into Visual Studio:

RTVS is an IDE and as such you can use it with any recent version of R such as 3.2.x. If you install the free Microsoft R Open, you automatically get some turbo options such as threading support on multi-processor machines, providing significant speedup for a variety of analytical functions, as well as package collections check-pointed to a particular date/version. Microsoft R Server provides Big Data support and additional advanced features that can be used with SQL Server.

This is an early release, so expect a few bugs and some missing functionality.  It also isn’t RStudio—it’s RStudio several years ago.  But what it does nicely is integrate with the rest of your stack:  you can tie together the R code, the C#/F# code which helps clean data, the SQL Server project which holds your data, etc. etc.

Comments closed

Credit Card Fraud Detection Using R

David Smith gives us a tutorial on credit card fraud detection:

If you have a database of credit-card transactions with a small percentage tagged as fraudulent, how can you create a process that automatically flags likely fraudulent transactions in the future? That’s the premise behind the latest Data Science Deep Dive on MSDN. This tutorial provides a step by step to using the R language and the big-data statistical models of the RevoScaleR package of SQL Server 2016 R Services to build and use a predictive model to detect fraud.

This looks to be a follow-up from the fraud detection series.

Comments closed