Press "Enter" to skip to content

Category: R

Running DoAzureParallel On The Cheap

David Smith reports an update on the doAzureParallel R package:

At the EARL conference in San Francisco this week, JS Tan from Microsoft gave an update (PDF slides here) on the doAzureParallel package . As we’ve noted here before, this package allows you to easily distribute parallel R computations to an Azure cluster. The package was recently updated to support using automatically-scaling Azure Batch clusters with low-priority nodes, which can be used at a discount of up to 80% compared to the price of regular high-availability VMs.

That lowers the barrier to usage significantly, so it’s a very welcome update.

Comments closed

Dot-Density Maps In R

Paul Campbell shows how to build a dot density map in R:

To get me started I invested in the expert guidance of data-visualiser-extraordinaire Nathan Yau, aka Flowing Data. Nathan has a whole host of tutorials on how to make really great visualisations in R (including a brand new course focused on mapping) and thankfully one of them deals with how to plot dot density using base R.

Now with a better understanding of the task at hand, I needed to find the required ethnicity data and shapefiles. I recently saw a video of Amelia McNamara’s great talk at the OpenVis Conference titled ‘How spatial polygons shape our world’. The .shp file really is a glorious thing and it seems that the spatial polygon makers are the unsung heros of the datavis world, so a big round of applause for all those guys is in order.

Anyway, I digress. Luckily for me, the good folks over at the London DataStore have a vast array of Shapefiles that go from Borough level all the way down to Super Output Area level. I’m going to use the Output Areas as the boundaries for the dots and the much broader Borough boundaries for ploting area labels and borders.

The ethnic group data itself was sourced from the Nomis website which has a handy 2011 Census table finder tool where you can easily download an Ethnic Group csv file for London output areas. Vamonos.

I’m going to give this a second reading; it’s a great example of how to go from functional to beautiful.  H/T David Smith

Comments closed

Riddler Nation: Game Theory In Action

Curtis Miller goes over a multi-phase distribution game with no known information:

The winning strategy of the last round, submitted by Vince Vatter, was (0, 1, 2, 16, 21, 3, 2, 1, 32, 22), with an official record1 of 751 wins, 175 losses, and 5 ties. Naturally, the top-performing strategies look similar. This should not be surprising; winning strategies exploit common vulnerabilities among submissions.

I’ve downloaded the submitted strategies for the second round (I already have the first round’s strategies). Lets load them in and start analyzing them.

This is a great blog post, which looks at using evolutionary algorithms to evolve a winning strategy.

Comments closed

The Multifaceted Nature Of R

John Mount points out that there are many ways to skin a cat in R:

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common Rfunctions: str(), head(), and the tibble package‘s glimpse().

This is a small example of a large phenomenon.

Comments closed

Sentiment Analysis In R

Stefan Feuerriegel and Nicolas Pröllochs have a new package in CRAN:

Our package “SentimentAnalysis” performs a sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as QDAP or Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogenous response variable.

I’m not sure how it stacks up to external services, but it’s another option available to us.

Comments closed

Summary Improvements In R

John Mount points out a nice quasi-bugfix in R 3.4.0:

In older versions of R (say R 3.3.1) the above code gave the following undesirable result:

summary(15555)

#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   15560   15560   15560   15560   15560   15560 

This was always very confusing and hard to explain to beginners. To justify this you had to explain that “R, by default, calculates the summary rounded to 4 significant digits, and is simultaneously configured to give absolutely no indication has to how many significant digits are in fact being displayed.” To add insult to injury summary()picked a different number of sigfigs than the default numeric presentation. One could type “median(15555)” and get the expected presentation “15555“.

I like this change.

1 Comment

Fresh R Installation On Linux

Marcelo Perlin has a script to install R on Linux:

Since I formatted all my three computers (home/laptop/work), I wrote a small bash file to automate the process of installing R and its dependencies. I use lots of R packages in a daily basis. For some of them, it is required to install dependencies using the terminal. Each time that a install.package() failed, I saved the name of the required software and added it to the bash file. While my bash file will not cover all dependencies for all packages, it will suffice for a great proportion.

Another option might be to grab a Docker image.

Comments closed

Multi-Channel Attribution With R

Sergey Bryl walks through some of the difficulties of the multi-channel attribution solution he came up with before:

The main steps that we will review are the following:

  • splitting paths depending on purchases counts

  • replacing some channels/touch points

  • a unique channel/touchpoint case

  • consequent duplicated channels in the path and higher order Markov chains

  • paths that haven’t led to a conversion

  • customer journey duration

  • attributing revenue and costs comparisons

There’s a lot there, and I like the practical explanations of issues when dealing with a real business problem.

Comments closed

Using sparklyr

Hossein Falaki and Xiangrui Meng show how to use sparklyr on a Databricks Spark cluster:

We collaborated with our friends at RStudio to enable sparklyr to seamlessly work in Databricks clusters. Starting with sparklyr version 0.6, there is a new connection method in sparklyr: databricks. When calling spark_connect(method = "databricks") in a Databricks R Notebook, sparklyr will connect to the spark cluster of that notebook. As this cluster is fully managed, you do not need to specify any other information such as version, SPARK_HOME, etc.

I’d lean toward sparklyr over SparkR because of the former’s tidyverse-centric view.

Comments closed

Versioning R Code In SQL Server

Steph Locke shows how to combine R models and SQL Server temporal tables for versioning:

If we’re storing our R model objects in SQL Server then we can utilise another SQL Server capability, temporal tables, to take the pain out of versioning and make it super simple.

Temporal tables will track changes automatically so you would overwrite the previous model with the new one and it would keep a copy of the old one automagically in a history table. You get to always use the latest version via the main table but you can then write temporal queries to extract any version of the model that’s ever been implemented. Super neat!

I do exactly this.  In my case, it’s to give me the ability to review those models after the fact once I know whether they generated good outcomes or not.

Comments closed