R In Linux For Windows

David Smith shows how to install and use R in the Windows Subsystem for Linux:

R has been available for Windows since the very beginning, but if you have a Windows machine and want to use R within a Linux ecosystem, that’s easy to do with the new Fall Creator’s Update (version 1709). If you need access to the gcc toolchain for building R packages, or simply prefer the bash environment, it’s easy to get things up and running.

Once you have things set up, you can launch a bash shell and run R at the terminal like you would in any Linux system. And that’s because this is a Linux system: the Windows Subsystem for Linux is a complete Linux distribution running within Windows. This page provides the details on installing Linux on Windows, but here are the basic steps you need and how to get the latest version of R up and running within it.

Click through for a quick tutorial.

A Hack For Dynamic ML Services Result Sets

Dave Mason has put together a solution to his dynamic data frame naming problem:

We can take those names and R types, string them together, and “convert” them to SQL data types. (Mapping data types from one language to another is waaaay outside the scope of this post. Lines 11-13 are quick and dirty, just for demonstration purposes. Okie dokie?)

It’s certainly a less than ideal solution, but it does the job.  And I hope as well that someday this functionality ends up being something built into the product.

Taking A Random Walk

Dan Goldstein describes the basics of Brownian motion:

I was sitting in a bagel shop on Saturday with my 9 year old daughter. We had brought along hexagonal graph paper and a six sided die. We decided that we would choose a hexagon in the middle of the page and then roll the die to determine a direction:

1 up (North)
2 diagonal to the upper right (Northeast)
3 diagonal to the lower right (Southeast)
4 down (South)
5 diagonal to the lower left (Southwest)
6 diagonal to the upper left (Northwest)

Our first roll was a six so we drew a line to the hexagon northwest of where we started. That was the first “step.”

After a few rolls we found ourselves coming back along a path we had gone down before. We decided to draw a second line close to the first in those cases.

We did this about 50 times. The results are pictured above, along with kid hands for scale.

Javi Fernandez-Lopez then shows how to generate an animated GIF displaying Brownian motion:

Last Monday we celebrated a “Scientific Marathon” at Royal Botanic Garden in Madrid, a kind of mini-conference to talk about our research. I was talking about the relation between fungal spore size and environmental variables such as temperature and precipitation. To make my presentation more friendly, I created a GIF to explain the Brownian Motion model. In evolutionary biology, we can use this model to simulate the random variation of a continuous trait through time. Under this model, we can notice how closer species tend to maintain closer trait values due to shared evolutionary history. You have a lot of information about Brownian Motion models in evolutionary biology everywhere!

Another place that this is useful is in describing stock market movements in the short run.

Introduction To Neural Nets

Ben Gorman has a two-part series introducing neural networks.  First, the basics behind neural networks:

We can solve both of the above issues by adding an extra layer to our perceptron model. We’ll construct a number of base models like the one above, but then we’ll feed the output of each base model as input into another perceptron. This model is in fact a vanilla neural network. Let’s see how it might work on some examples.

Then, he digs into the mathematics of backpropagation:

Our problem is one of binary classification. That means our network could have a single output node that predicts the probability that an incoming image represents stairs. However, we’ll choose to interpret the problem as a multi-class classification problem – one where our output layer has two nodes that represent “probability of stairs” and “probability of something else”. This is unnecessary, but it will give us insight into how we could extend task for more classes. In the future, we may want to classify {“stairs pattern”, “floor pattern”, “ceiling pattern”, or “something else”}.

Our measure of success might be something like accuracy rate, but to implement backpropagation (the fitting procedure) we need to choose a convenient, differentiable loss function like cross entropy. We’ll touch on this more, below.

This is definitely a series to read after you’ve gotten your coffee.

Microsoft AI School: ML Services

David Smith points out an interesting resource for learning more about Microsoft ML Services:

If you’d like to learn how you use R to develop AI applications, the Microsoft AI School now features a learning path focused on Microsoft R and SQL Server ML Services. This learning path includes eight modules, each comprising detailed tutorials and examples:

Read on for more details.

Outlier Detection In R

Giorgio Garziano has an introduction to outlier detection and intervention analysis using R:

Now, we implement a similar representation of the transient change outlier by taking advantage of the arimax() function within the TSA package. The arimax() function requires to specify some ARMA parameters, and that is done by capturing the seasonality as discussed in ref. [1]. Further, the transient change is specified by means of xtransf and transfer input parameters. The xtransf parameter is a matrix with each column containing a covariate that affects the time series response in terms of an ARMA filter of order (p,q). For our scenario, it provides a value equal to 1 at the outliers time index and zero at others. The transfer parameter is a list consisting of the ARMA orders for each transfer covariate. For our scenario, we specify an AR order equal to 1.

Check it out.

Warning When Using dplyr Mutate

John Mount has a warning if you are using dplyr’s mutate function and connecting to Spark or a database:

If you are using the R dplyr package with a database or with Apache Spark: I respectfully advise you inspect your code to ensure you are not using any values created inside a dplyr::mutate() statement inside the same dplyr::mutate() statement. This has been my coding advice for some time, and it is a simple and safe re-factoring to break up such statements into safer sequences (simply by introducing more dplyr::mutate()s).

I have since encountered a non-signaling (or silent) result corruption version of the issue. We are now advising code inspection as we now have confirmation that not seeing a thrown error is not a reliable indication of correct execution and correct results.

Thanks to John for reporting, and hopefully the dplyr team can fix it.

Network And Sankey Diagrams In Python And R

Tony Hirst has a roundup of various R and Python packages which build network charts or Sankey diagrams:

Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.

The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!

Sankey diagrams are on my list of dangerous visuals:  done right, they are informative, but it’s easy to try to put too much into the diagram and thereby confuse everybody.

Bridging The R-Python Gap

Siddarth Ramesh argues that revoscalepy helps R developers acquaint themselves with Python:

I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.

Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python.

Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy.

Having worked with both, my loyalties tend to lie with R for a couple of reasons.  But this might help some people bridge the gap.

Using Keras To Predict Customer Churn

Matt Dancho has an example of building a neural net using Keras to predict customer churn:

Pro Tip: A quick test is to see if the log transformation increases the magnitude of the correlation between “TotalCharges” and “Churn”. We’ll use a few dplyr operations along with the corrr package to perform a quick correlation.

  • correlate(): Performs tidy correlations on numeric data

  • focus(): Similar to select(). Takes columns and focuses on only the rows/columns of importance.

  • fashion(): Makes the formatting aesthetically easier to read.

This is a very useful tutorial.


December 2017
« Nov