Press "Enter" to skip to content

Category: R

Pipelines Everywhere

John Mount explains the benefit of pipes and pipelines, and shows us an advanced pipe in R:

The idea is: many important calculations can be considered as a sequence of transforms applied to a data set. Each step may be a function taking many arguments. It is often the case that only one of each function’s arguments is primary, and the rest are parameters. For data science applications this is particularly common, so having convenient pipeline notation can be a plus. An example of a non-trivial data processing pipeline can be found here.

In this note we will discuss the advanced R pipeline operator “dot arrow pipe” and an S4 class (wrapr::UnaryFn) that makes working with pipeline notation much more powerful and much easier.

As you’d expect from John, there’s a lot of detail and it’s an interesting approach.

Comments closed

On R Packages And Trust

Colin Gillespie shares some thoughts about the potentially over-trusting nature of R developers:

One of the great things about R, is the myriad of packages. Packages are typically installed via

– CRAN
– Bioconductor
– GitHub

But how often do we think about what we are installing? Do we pay attention or just install when something looks neat? Do we think about security or just take it that everything is secure? In this post, we conducted a little nefarious experiment to see if people pay attention to what they install.

Packages are code and like any other code, R packages can contain malicious content.

Comments closed

Market Basket Analysis With arulesSequences

Allison Koenecke takes us through the arulesSequences package in R:

In the following tutorial, we answer both questions using the R package arulesSequences [4], which implements the SPADE algorithm [5]. Concretely, given data in an Excel spreadsheet containing historical customer service purchase data, we produce two separate Excel sheet deliverables: a list of service bundles, and a set of temporal rules showing how service bundles evolve over time.  We will focus on interpreting the latter result by showing how to use temporal rules in making predictive sales recommendations.

Our running example below is inspired by the need for Microsoft’s Azure Services salespeople to suggest which additional products to recommend to customers, given the customers’ current cloud product consumption services mix.  We’d like to know, for instance, if customers who have implemented web services also purchase web analytics within the next month.  Actual Azure Service names have been removed for confidentiality reasons.

Market basket analysis is an interesting topic, though in my limited experience, it really falls apart when you have a large number of products to compare, so it tends to work better with toy examples or limited product selections because when you have a 50,000+ SKU inventory, the lift of any individual combination of products rarely gets above the level of noise.

Comments closed

Preparing Text Data For Natural Language Processing

Shirin Glander takes us through the process of preparing natural language data for machine learning using Keras:

As with any neural network, we need to convert our data into a numeric format; in Keras and TensorFlow we work with tensors. The IMDB example data from the keras package has been preprocessed to a list of integers, where every integer corresponds to a word arranged by descending word frequency.

So, how do we make it from raw text to such a list of integers? Luckily, Keras offers a few convenience functions that make our lives much easier.

This is a very nice tutorial if you’re new to the process.

Comments closed

UTF-8 And R

Sebastian Sauer gives us a brief overview of UTF-8 support in R and other relevant tools (like Excel):

That seems to work easily. Maybe that’s the easiest way at the end of the day (?).

One problem that may arise – besides building on proprietary code that may change without notice – is that Excel may have problems reading a UTF8 csv, as explained here.

Read on for more info on what has become the de facto web standard for text.

Comments closed

Analytical Pipelines In R With H2O And AWS

Hanjo Oden wraps up a series on training models on AWS using H2O in R:

To generate these, you can log into your AWS dashboard, go to the IAM (Identity and Access Management) dashboard and select the Users tab. On the Userstab, add a user and also the administration rights that you want the user to have.Remember to restart R once you have filled in the access key information in the .Renviron file for it to take effect.

At this point, those familiar with cloudyr suite is probably asking – “This is exactly the same as library(aws.ec2), so why use boto3?“. Well, to be honest, I was using aws.ec2 for a while, but I find spot-instances, which the current version of aws.ec2 does not support. In addition I found that boto3 has some other functionalitue – which I prefer. For a full list of boto3 functions to interact with an EC2 instance, have a look at the reference manual.

It’s pretty good stuff; check it out.

Comments closed

Genetic Algorithms In R

Pablo Casas touches on one of my favorite lost causes:

In machine learning, one of the uses of genetic algorithms is to pick up the right number of variables in order to create a predictive model.

To pick up the right subset of variables is a problem of combinatory logic and optimization.

The advantage of this technique over others is that it allows the best solution to emerge from the best of the prior solutions. An evolutionary algorithm which improves the selection over time.

The idea of GA is to combine the different solutions generation after generation to extract the best genes (variables) from each one. That way it creates new and more fit individuals.

We can find other uses of GA such as hyper-tunning parameters, finding the maximum (or minimum) of a function, or searching for the correct neural network architecture (neuroevolution), among others.

I’ve seen a few people use genetic algorithms in the past decade, but usually for hyperparameter tuning rather than as a primary algorithm. It was always the “algorithm of last resort” even before neural networks took over the industry, but if you want to spend way too much time on the topic, I have a series. If you have too much time on your hands and meet me in person, ask about my thesis.

Comments closed

Running RStudio Server In Azure

David Smith notes that RStudio Server Pro is now available on Azure:

RStudio Server Pro is now available on the Azure Marketplace, the company announced on the RStudio Blog earlier this month. This means you can launch RStudio Server Pro on an virtual machine with the memory, disk, and CPU configuration of your choice, and pay by the minute for the VM instance plus a the RStudio software charge. Then, you can use a browser to access the remote RStudio Server (the interface is nigh-indistinguishable from the desktop version), with access to the commercial features of RStudio including support for multiple R version and concurrent R sessions, load-balancing and high availability instances, and enhanced security.

RStudio Server Pro and Microsoft R Server are both very nice for production-quality R servers. You can get away with the open source versions, but there are some good reasons to use the enterprise-grade products in an enterprise.

Comments closed

Using Plotly In Power BI

Kara Annanie shows how you can R integration in Power BI to push Plotly visuals to users:

In the example, above, we’ve created a line chart visualization using Plotly and we’ve decided to put labels on the graph, but only on the first and last points of the line graph. This graph would be particularly useful to show 13 months of data overtime, where the left-most label shows January of last year, for example, and the right-most label shows January of this year, for example. The user could still view the trend across the year between both January data points.

Click through for a pair of videos and some notes on how to get started.

Comments closed

Inline Operators In R With wrapr

John Mount shows how to use inline operators in R with the wrapr package:

The above code is assuming you have the wrapr package attached via already having run library('wrapr').

Notice we picked R-related operator names. We stayed away from overloading the + operator, as the arithmetic operators are somewhat special in how they dispatch in R. The goal wasn’t to make R more like Python, but to adapt a good idea from Python to improve R.

Also, it’s a little late to pick up the discount (though Manning has discounts pretty much every day so be patient and you’ll find 40+ percent off) but check out the second edition of Practical Data Science with R by Mount and Nina Zumel. I’ve held off on reading it so far because I want to wait until it’s closer to completion, but it is on my to-read list.

Comments closed