Category: R

Fun With Tibbles

Published 2018-01-15 by Kevin Feasel

Theo Roe provides an introduction to tibbles in R:

Tibbles are a modern take on data frames, but crucially they are still data frames. Well, what’s the difference then? There’s a quote I found somewhere on the internet that decribes the difference quite well;

“keeping what time has proven to be effective, and throwing out what is not”.

Basically, some clever people took the classic data.frame(), shook it til the ineffective parts fell out, then added some new, more appropriate features.

I probably don’t do enough with tibbles, but the upside is that in most cases, there’s a smooth transition.

Comments closed

Analyzing Data Professional Salary Data

Published 2018-01-15 by Kevin Feasel

Ginger Grant has built a dashboard to analyze data professional salaries:

In the survey for 2018, the people who made the most money were from Hong Kong with an average salary of $263,289. Before you start planning on moving, you will might want to look at the data a little closer. There were 2 people who responded from Hong Kong. One of them said he was making over 1.4 million dollars, the highest amount reported in the survey. Given the fact that we only have two responses from Hong Kong, we will be unable to draw a definitive conclusion with 2 records. To be able to answer that question, more analysis will need to be done on the location and salary information and you will probably want to add market basket criteria because a dollar say in Hong Kong doesn’t go as far as the average apartment rental is $3,237 a month as it does say in Uganda where the rent is around $187 a month.

Click through to see the final product and grab a copy of her dashboard.

Comments closed

Using rquery To Speed Up Data Manipulations

Published 2018-01-12 by Kevin Feasel

John Mount shows off some rquery benchmarks versus dplyr and data.table:

Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This is where rquery works on in-memory data.frame data by sending it to a database, processing on the database, and then pulling the data back. We concede this is a strange way to process data, and not rquery’s primary purpose (the primary purpose being generation of safe high performance SQL for big data engines such as Spark and PostgreSQL). However, our experiments show that it is in fact a competitive technique.

We’ve summarized the results of several experiments (experiment details here) in the following graph (graphing code here). The benchmark task was hand implementing logistic regression scoring. This is an example query we have been using for some time.

There are some nice early results, so it’ll be interesting to watch as this develops.

Comments closed

Tidytext 0.1.6

Published 2018-01-12 by Kevin Feasel

Julia Silge announces a new version of tidytext:

I am pleased to announce that tidytext 0.1.6 is now on CRAN!

Most of this release, as well as the 0.1.5 release which I did not blog about, was for maintenance, updates to align with API changes from tidytext’s dependencies, and bugs. I just spent a good chunk of effort getting tidytext to pass R CMD check on older versions of R despite the fact that some of the packages in tidytext’s Suggests require recent versions of R. FUN TIMES. I was glad to get it working, though, because I know that we have users, some teaching on university campuses, etc, who are constrained to older versions of R in various environments.

There are some more interesting updates. For example, did you know about the new-ish stopwords package? This package provides access to stopword lists from multiple sources in multiple languages. If you would like to access these in a list data structure, go to the original package. But if you like your text tidy, I GOT YOU.

Read on for examples and grab the latest version.

Comments closed

Geocoding With OpenStreetMap

Published 2018-01-10 by Kevin Feasel

Dmitry Kisler shows how to geocode addresses in R using the OpenStreetMap API:

It is quite likely to get address info when scraping data from the web, but not geo-coordinates which may be required for further analysis like clustering. Thus geocoding is often needed to get a location’s coordinates by its address.

There are several options, including one of the most popular, google geocoding API. This option can be easily implemented into R with the function geocode from the library ggmap. It has the limitation of 2500 request a day (when it’s used free of charge), see details here.

To increase the number of free of charge geocoding requests, OpenStreetMap (OSM) Nominatim API can be used. OSM allows up to 1 request per second (see the usage policy), that gives about 35 times more API calls compared to the google geocoding API.

Click through for the script.

Comments closed

Connecting R To Google Sheets

Published 2018-01-05 by Kevin Feasel

Rob Grant shows how to connect to Google Sheets with R:

That was a quick overview of the most basic functions of the google sheets package.

This is a really useful package. A lot of my work involves reading data in Google Sheets either before or after using R.

Googlesheets means I won’t have to bother with read.csv() or write.csv() as much in the future, saving me time.

Click through for a good tutorial.

Comments closed

Parallelization With Rcpp

Published 2018-01-05 by Kevin Feasel

Blazej Moska demonstrates how to use Rcpp to parallelize R code:

One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with.

Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule’s antecedent is a subset of second rule’s antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set).

Read the whole thing.

Comments closed

Plotting Graph Data In R

Published 2018-01-04 by Kevin Feasel

Sifiso Ndlovu shows how to take graph data from SQL Server and plot it in R using Machine Learning Services:

However, with recent focus on big data for many of my clients, we have experienced an increase in different business requests that requires for many-to-many data modelling. Consequently, as a Microsoft shop we’ve had to turn to other non-Microsoft products to ensure that we optimally respond to such business requests. Not surprisingly, ever since word got around that graph database will be part of SQL Server 2017, we’ve been looking forward to this latest release of SQL Server. Having played around with the graph database feature in SQL Server 2017, we have noticed that unlike other graph database vendors, plotting and visualising the data out of the graph database is not readily available in SQL Server 2017. Luckily, thanks to SQL Server R, you can easily plot and visualise SQL Server 2017 graph database data without turning to 3^rd party plugins. In this article, I demonstrate how SQL Server Machine Learning Services (previously known as SQL Server 2016 R Services) can be used to plot a diagram according to the data defined in a SQL Server 2017 graph database.

The igraph library is a good one; there’s a lot of power in it that this post just introduces.

Comments closed

ML Services Can Fill The Plan Cache

Published 2018-01-04 by Kevin Feasel

I have a post talking about a bug in SQL Server:

For now, the workaround I have is to restart the SQL Server service occasionally. You can see that I have done it twice in the above screenshot. Our application is resilient to short database downtimes, so this isn’t a bad workaround for us; it’s just a little bit of an annoyance.

One thing to keep in mind if you are in this scenario is that if you are running ML Services hundreds of thousands of times a day, your ExtensibilityData folders might have a lot of cruft which may prevent the Launchpad service from starting as expected. I’ve had to delete all folders in \MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData\MSSQLSERVER01 after stopping the SQL Server service and before restarting it. The Launchpad service automatically does this, but if you have a huge number of folders in there, the service can time out trying to delete all of them. In my experience at least, the other folders didn’t have enough sub-folders inside to make it worth deleting, but that may just be an artifact of how we use ML Services.

It’s very unlikely to affect most shops, as we only notice it after running sp_execute_external_script millions of times, and that’s pretty abnormal behavior.

Comments closed

Project-Oriented R Development

Published 2018-01-03 by Kevin Feasel

Jenny Bryan explains how building projects in R can reduce the likelihood that someone will come in and set your computer on fire:

I suggest organizing each data analysis into a project: a folder on your computer that holds all the files relevant to that particular piece of work. I’m not assuming this is an RStudio Project, though this is a nice implementation discussed below.

Any resident R script is written assuming that it will be run from a fresh R process with working directory set to the project directory. It creates everything it needs, in its own workspace or folder, and it touches nothing it did not create. For example, it does not install additional packages (another pet peeve of mine).

This convention guarantees that the project can be moved around on your computer or onto other computers and will still “just work”. I argue that this is the only practical convention that creates reliable, polite behavior across different computers or users and over time. This convention is neither new, nor unique to R.

I admit that I’m just now getting into using projects regularly for my one-off stuff. This is very good advice. H/T David Smith

Comments closed