Press "Enter" to skip to content

Category: R

Crime Analysis

Raghavan Madabusi combines Zeppelin, R, and Spark to perform crime analysis:

Apache Zeppelin, a web-based notebook, enables interactive data analytics including Data Ingestion, Data Discovery, and Data Visualization all in one place. Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin supports many interpreters such as Spark (Scala, Python, R, SparkSQL), Hive, JDBC, and others. Zeppelin can be configured with existing Spark eco-system and share SparkContext across Scala, Python, and R.

This links to a rather long post on how to set up and use all of these pieces.  I’m more familiar with Jupyter than Zeppelin, but regardless of the notebook you choose, this is a good exercise to become familiar with the process.

Comments closed

Predictive Maintenance

David Smith shows off a predictive maintenance gallery for dealing with aircraft engines:

In each case, a number of different models are trained in R (decision forests, boosted decision trees, multinomial models, neural networks and poisson regression) and compared for performance; the best model is automatically selected for predictions.

On a related note, Microsoft recently teamed up with aircraft engine manufacturer Rolls-Royceto help airlines get the most out of their engines. Rolls-Royce is turning to Microsoft’s Azure cloud-based services — Stream Analytics, Machine Learning and Power BI — to make recommendations to airline executives on the most efficient way to use their engines in flight and on the ground. This short video gives an overview.

Check out the data set and play around a bit.

Comments closed

Feather

David Smith discusses Feather:

Unlike most other statistical software packages, R doesn’t have a native data file format. You can certainly import and export data in any number of formats, but there’s no native “R data file format”. The closest equivalent is the saveRDS/loadRDS function pair, which allows you to serialize an R object to a file and then load it back into a later R session. But these files don’t hew to a standardized format (it’s essentially a dump of R in-memory representation of the object), and so you can’t read the data with any software other than R.

The goal of the feather project, a collaboration of Wes McKinney and Hadley Wickham, is to create a standard data file format that can be used for data exchange by and between R, Python, and any other software that implements its open-source format. Data are stored in a computer-native binary format, which makes the files small (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file), and fast to read and write (no need to convert numbers to text and back again). Another reason why feather is fast is that it’s a column-oriented file format, which matches R’s internal representation of data. (In fact, feather is based on the Apache Arrow framework for working with columnar data stores.) When reading or writing traditional data files with R, it must spend signfican time translating the data from column format to row format and back again; with feather the entire second step in the process below is eliminated.

Given the big speedup in read time, I can see this file format being rather useful.  I just can’t see it catching on as a common external data format, though, unless most tools get retrofitted to support the file.  So instead, it’d end up closer to something like Avro or Parquet:  formats we use in our internal tools because they’re so much faster, but not formats we send across to other companies because they’re probably using a different set of tools.

Comments closed

Looking At R Services

Gail Shaw reviews R support in SQL Server 2016:

It’s not fast. The above piece of T-SQL took ~4 seconds to execute. This is on an Azure A3 VM. Not a great machine admittedly, but the R code, which just returns the first 6 rows of a built-in data set, ran in under a second on my desktop. This is likely not something you’ll be doing as part of an OLTP process.

I hope this external_script method is temporary. It’s ugly, hard to troubleshoot, and it means I have to write my R somewhere else, probably R Studio, maybe Visual Studio, and move it over once tested and working. I’d much rather see something like

I agree with the sp_execute_external_script mess.  It’s the worst of dynamic SQL combined with multiple languages (T-SQL for the stored procedure & R for the contents, but taking care to deal with T-SQL single-quoting).  Still, even with these issues, I think this will be a very useful tool for data analysts, particularly when dealing with rather large data sets on warehouse servers with plenty of RAM.

Comments closed

R In SQL Server 2016

Ginger Grant walks through installing R for SQL Server 2016:

The code is executed as an external script, specifying that the language used should be R. @script contains the R code, which is a simple command to take the mean of the data coming from the InputDataSet. @Input_Data_1 contains the location of the data to be processed. In this case the data set is a table containing Amazon review data, where the overall field is the rating field. Of course the R code could of course be more complicated, but I was hoping that this example was generic enough that many people would be able to duplicate it and run their first R code.

This is quite a bit easier to install in RTM(ish) than it was back in CTP 3, so good job Microsoft.

Comments closed

Interactive Heatmaps

Sahir Bhatnagar uses heatmaply to generate heatmaps:

In every statistical analysis, the first thing one should do is try and visualise the data before any modeling. In microarray studies, a common visualisation is a heatmap of gene expression data.

In this post I simulate some gene expression data and visualise it using theheatmaply package in R by Tal Galili. This package extends the plotly engine to heatmaps, allowing you to inspect certain values of the data matrix by hovering the mouse over a cell. You can also zoom into a region of the heatmap by drawing a rectangle over an area of your choice

This went way past my rudimentary heatmap skills, so it’s nice to see what an advanced user can do.

Comments closed

Data Frames

Saravanan Subramanian has an introduction to data frames in R:

The R data frame is a high level data structure which is equivalent to a table in database systems.  It is highly useful to work with machine learning algorithms, and it’s very flexible and easy to use.

The standard definition of data frames are a “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R‘s modeling software.”

Data frames are a powerful abstraction and make R a lot easier for database professionals than application developers who are used to thinking iteratively and with one object at a time.

Comments closed

Installing SQL Server R Services Packages

Julie Koesmarno shows how to install an R package on a SQL Server 2016 instance which has SQL Server R Services installed:

When you start playing with R in SQL Server, sooner or later you would need to install some packages, for example ggplot2. You may run into a problem that sounds like this “Error in library(“ggplot2”) : there is no package called ‘ggplot2’“.

The following script is used in the iris_demo.sql (SQLServer2016CTP3Samples\Advanced Analytics\iris_demo.sql), and would cause a missing library error if you don’t have the packages installed on SQL Server R Services yet.

Julie shows two methods, one a Good Idea and the other a Bad(?) Idea.

Comments closed

Mapping German Postal Codes With R

Achim Rumberger shows how to map German postal codes using R:

Just at this time Ari published his webinar about getting shape files into R. Which also includes a introduction to shape files to get you going, if you are new to it, as I am. I remembered Ari from his mail course introducing his great R-package (choroplethr). By the way this is a terrible name, being a biologist by heart, I always type “chloroplethr”, as in “chlorophyll”, and this is not found by the R package manager. [Editor’s note: I agree!]

Next question, where do I get the shapefiles, describing Germany? A major search engine was of great help here. http://www.suche-postleitzahl.org/downloads?download=zuordnung_plz_ort.csv . Germany has some 8700 zip code areas, so expect some time for rendering the file, if you do on your computer. Right on this side one can also find a dataset which might act as a useful warm up practice to display statistical data in a geographical context. Other sources are https://datahub.io/de/dataset/postal-codes-de

This is really cool.

Comments closed