Press "Enter" to skip to content

Category: R

Microsoft R Open 3.4.0

David Smith announces Microsoft R Open 3.4.0:

R 3.4.0 (upon which MRO 3.4.0 is based) is a major update to the R language, with many fixes and improvements. Most notably, R 3.4.0 introduces a just-in-time (JIT) compiler to improve performance of the scripts and functions that you write. There have been a few minor tweaks to the language itself, but in general functions and packages written for R 3.3.x should work the same in R 3.4.0. As usual, MRO points to a fixed CRAN snapshot from May 1 2017, but you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).

The version of Microsoft R Server shipping with SQL Server 2017 will still be based on 3.3.3, but I’m going to guess that a new version of Microsoft R Server supporting 3.4.0 will ship in the next several months.

Comments closed

Fisher’s Exact Test

Mala Mahadevan explains Fisher’s Exact Test and provides examples in T-SQL and R:

The decision rule in two sample tests of hypothesis depends on three factors :
1 Whether the test is upper, lower or two tailed (meaning the comparison is greater, lesser or both sides of gender and speaker count)
2 The level of significance or degree of accuracy needed,
3 The form of test statistic.
Our test here is to just find out if gender and speaker count are related so it is a two tailed test. The level of significance we can use is the most commonly used 95% which is also the default in R for Fischer’s Test. The form of the test statistic is P value. So our decision rule would be that gender and speaker category are related if P value is less than 0.05.

Click through for the R code followed by a code sample which should explain why you don’t want to do it in T-SQL.

1 Comment

Deploying Packages To SQL Server R Services

Tracy Boggiano has a Powershell script to deploy packages to an instance running SQL Server R Services:

Somehow I have become the R DBA at my job which I don’t mind, I plan on taking Microsoft’s Professional Program on Data Science to be familiar with it.  But recently I’ve had to upload files to our R servers which the first time wasn’t too bad.  Copy these files to six different servers but come the second time around it became apparent that the Predictive Analytics Manger was going to be asking me to do this more frequently than I wanted to to it manually.  So I wrote a quick PowerShell function to take care of this added to our module we use in house.  It unzips the file provided to the correct location.  It does assume you have administrative rights to your server i.e. you can use the admin shares (c$) for example on the server.  You will need to get the function Get-CMSHost from my Running SQL Scripts Against Multiple Servers Using PowerShell post to run the code below.

Click through for the script.  This is particularly useful for deploying in-house packages and you don’t want to set up a miniCRAN.

Comments closed

Three-Way Variance Analysis

Bogdan Anastasiei shows how to perform a three-way variance analysis when the third-order and second-order effects are both statistically significant:

In the formula above the interaction effect is, of course, dosegendertype. The ANOVA results can be seen below (we have only kept the line presenting the third-order interaction effect).

Df Sum Sq Mean Sq F value   Pr(>F)
dose:gender:type   2    187    93.4  22.367 3.81e-10

The interaction effect is statistically significant: F(2)=22.367, p<0.01. In other words, we do have a third-order interaction effect. In this situation, it is not advisable to report and interpret the second-order interaction effects (they could be misleading). Therefore, we are going to compute the simple second-order interaction effects.

This is definitely not a trivial article, but there are useful techniques in it.

Comments closed

Building A Spinning Globe With R

James Cheshire shows how to use R to create an image of a spinning globe:

It has been a long held dream of mine to create a spinning globe using nothing but R (I wish I was joking, but I’m not). Thanks to the brilliant mapmate package created by Matt Leonawicz and shed loads of computing power, today that dream became a reality. The globe below took 19 hours and 30 processors to produce from a relatively low resolution NASA black marble data, and so I accept R is not the best software to be using for this – but it’s amazing that you can do this in R at all!

Now all that is missing is a giant TV and an evil lair.

Comments closed

Multiple R Studio Users On HDInsight

Xiaoyong Zhu shows how to set up additional R Studio users in an HDInsight cluster:

Basically speaking, the “http user” will be used to authenticate through the HDInsight gateway, which is used to protect the HDInsight clusters you created. This user is used to access the Ambari UI, YARN UI, as well as many other UI components.

The “ssh user” will be used to access the cluster through secure shell. This user is actually a user in the Linux system in all the head nodes, worker nodes, edge nodes, etc., so you can use secure shell to access the remote clusters.

For Microsoft R Server on HDInsight type cluster, it’s a bit more complex, because we put R Studio Server Community version in HDInsight, which only accepts Linux user name and password as login mechanisms (it does not support passing tokens), so if you have created a new cluster and want to use R Studio, you need to first login using the http user’s credential and login through the HDInsight Gateway, and then use the ssh user’s credential to login to RStudio.

It’s a good read and also includes a sample Spark-R job.

Comments closed

R Is Bad For You?

Bill Vorhies lays out a controversial argument:

I have been a practicing data scientist with an emphasis on predictive modeling for about 16 years.  I know enough R to be dangerous but when I want to build a model I reach for my SAS Enterprise Miner (could just as easily be SPSS, Rapid Miner or one of the other complete platforms).

The key issue is that I can clean, prep, transform, engineer features, select features, and run 10 or more model types simultaneously in less than 60 minutes (sometimes a lot less) and get back a nice display of the most accurate and robust model along with exportable code in my selection of languages.

The reason I can do that is because these advanced platforms now all have drag-and-drop visual workspaces into which I deploy and rapidly adjust each major element of the modeling process without ever touching a line of code.

I have almost exactly the opposite thought on the matter:  that drag-and-drop development is intolerably slow; I can drag and drop and connect and click and click and click for a while, or I can write a few lines of code.  Nevertheless, I think Bill’s post is well worth reading.

Comments closed

Pretty R Plots

Simon Jackson has a couple posts on how to use ggplot2 to make graphs prettier.  First, histograms:

Time to jazz it up with colour! The method I’ll present was motivated by my answer to this StackOverflow question.

We can add colour by exploiting the way that ggplot2 stacks colour for different groups. Specifically, we fill the bars with the same variable (x) but cut into multiple categories:

Then he follows up with scatter plots:

Shape and size

There are many ways to tweak the shape and size of the points. Here’s the combination I settled on for this post:

There are some nice tricks here around transparency, color scheme, and gradients, making it a great series.  As a quick note, this color scheme in the histogram headliner photo does not work at all for people with red-green color-blindness.  Using a URL color filter like Toptal’s is quite helpful in discovering these sorts of issues.

Comments closed

R And Python Support In VS 2017

David Smith announces that Visual Studio 2017 now supports R Tools for Visual Studio and Python Tools for Visual Studio:

The new Visual Studio 2017 has built-in support for programming in R and Python. For older versions of Visual Studio, support for these languages has been available via the RTVS and PTVS add-ins, but the new Data Science Workloads in Visual Studio 2017 make them available without a separate add-in. Just choose the “Data Science and analytical applications” option during installation to install everything you need, including Microsoft R Client and the Anaconda Python distribution.

I’m personally going to wait a little bit before jumping onto Visual Studio 2017, but I’m glad that RTVS is now available.

Comments closed

Basic Data Tidying

Sarah Dutkiewicz tidies up a data set in R:

Looking at this data, the first thing I thought was untidy. There has to be a better way. When I think of tidy data, I think of the tidyr package, which is used to help make data tidy, easier to work with. Specifically, I thought of the spread() function, where I could break things up. Once data was spread into appropriate columns, I figure I can operate on the data a bit better.

Sarah has also made the data set available in case you’re interested in following along.

Comments closed