The R data frame is a high level data structure which is equivalent to a table in database systems. It is highly useful to work with machine learning algorithms, and it’s very flexible and easy to use.
The standard definition of data frames are a “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R‘s modeling software.”
Data frames are a powerful abstraction and make R a lot easier for database professionals than application developers who are used to thinking iteratively and with one object at a time.
When you start playing with R in SQL Server, sooner or later you would need to install some packages, for example ggplot2. You may run into a problem that sounds like this “Error in library(“ggplot2”) : there is no package called ‘ggplot2’“.The following script is used in the iris_demo.sql (SQLServer2016CTP3Samples\Advanced Analytics\iris_demo.sql), and would cause a missing library error if you don’t have the packages installed on SQL Server R Services yet.
Julie shows two methods, one a Good Idea and the other a Bad(?) Idea.
Just at this time Ari published his webinar about getting shape files into R. Which also includes a introduction to shape files to get you going, if you are new to it, as I am. I remembered Ari from his mail course introducing his great R-package (choroplethr). By the way this is a terrible name, being a biologist by heart, I always type “chloroplethr”, as in “chlorophyll”, and this is not found by the R package manager. [Editor’s note: I agree!]
Next question, where do I get the shapefiles, describing Germany? A major search engine was of great help here. http://www.suche-postleitzahl.org/downloads?download=zuordnung_plz_ort.csv . Germany has some 8700 zip code areas, so expect some time for rendering the file, if you do on your computer. Right on this side one can also find a dataset which might act as a useful warm up practice to display statistical data in a geographical context. Other sources are https://datahub.io/de/dataset/postal-codes-de
This is really cool.
Data tends to come from databases that must support many different tasks, so it is exactly the case that there may be columns or variables that are correlated to unknown and unwanted additional processes. The reason PCA can’t filter out these noise variables is that without use of y, standard PCA has no way of knowing what portion of the variation in each variable is important to the problem at hand and should be preserved. This can be fixed through domain knowledge (knowing which variables to use), variable pruning and y-aware scaling. Our next article will discuss these procedures; in this article we will orient ourselves with a demonstration of both what a good analysis and what a bad analysis looks like.
All the variables are also deliberately mis-scaled to model some of the difficulties of working with under-curated real world data.
This does read like an academic paper, so it’s pretty heavy reading. It’s also very good reading from a great writer, so take some time and give it a read if you do data analysis.
The examples were done using Microsoft R Open, but since it’s 100% compatible with R the code works with any relatively recent R version.
Naomi and Joyce presented several examples from their e-book in a recent webinar (presented by Microsoft), and fielded lots of interesting questions from the audience. If you’d like to see the recorded webinar and also receive a copy of the slides and the e-book, follow the link below to register to receive the materials via email.
The book is free, the code is available on GitHub. What more could you ask for?
In addition to the original features in the raw data, we add number of bikes rented in each of the previous 12 hours as features to provide better predictive power. We create a
computeLagFeatures()helper function to compute the 12 lag features and use it as the transformation function in
rxDataStep()processes data chunk by chunk and lag feature computation requires data from previous rows. In
computLagFeatures(), we use the internal function
.rxSet()to save the last n rows of a chunk to a variable lagData. When processing the next chunk, we use another internal function
.rxGet()to retrieve the values stored in lagData and compute the lag features.
This is a great article for anybody wanting to dig into analytics, because they show their work.
As this is a major release, you’ll need to re-install any packages you were using (and perhaps wait a little while until package authors make any compatibility fixes needed for version 3.3.0). If you’re on the Windows platform, Tal Galili’s installr package automates the process for you. If you are using the checkpoint package (on any platform) you can simply increment the checkpoint date to anytime after May 2, 2016.
(For Microsoft R Open users, the next version to be released will be MRO 3.2.5, and MRO 3.3.0 will follow soon thereafter.)
For more information about R 3.3.0, including the detailed list of changes and bug fixes, follow the link to the announcement from the R Core Group below.
R Tools for Visual Studio, the open-source extenstion to Visual Studio that provides an IDE for the R language, has been upgraded to include several new features.
The latest update, RTVS 0.3, now includes:
An R package manager, allowing you to review, install, and uninstall packages using a convenient user interface.
The Variable Explorer now allows you to open data-frames for viewing in an Excel workbook.
New toolbar buttons to run selected code, source the current script, import data from a URL or file, and start/stop a Shiny app.
This is a great time to get interested in R. If you’re familiar with Visual Studio, Microsoft is making great strides toward integrating things nicely.
As Mario Inchiosa and Roni Burd demonstrate in this recorded webinar, Microsoft R Server can now run within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR (pdf) take advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. And if your data grows or you just need more power, you can dynamically add nodes to the HDInsight cluster using the Azure portal.
I don’t normally link to webinars (because they tend to violate my “should be viewable in a coffee break” rule of thumb) but I have a soft spot in my heart for these technologies. If you want to dig into more “mainstream” (off the Microsoft platform) Spark + R fun, check out SparkR.
Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.
To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.
If you’re looking for a data set for exploration, this is certainly a good one.
With my HIBPwned package, I consume the HaveIBeenPwned API and return back a list object with an element for each email address. Each element holds a data.frame of breach data or a stub response with a single column data.frame containing NA. Elements are named with the email addresses they relate to. I had a list of data.frames and I wanted a consolidated data.frame (well, I always want a data.table).
Enter data.table …
data.table has a very cool, and very fast function named
rbindlist(). This takes a list of data.frames and consolidates them into one data.table, which can, of course, be handled as a data.frame if you didn’t want to use data.table for anything else.
Something that continuously amazes me with R is just how terse the language can be without collapsing into Perl.