Icon Maps in R

Laura Ellis shows how you can build maps full of little icons:

That was ok, but we should try to make the images more aesthetically pleasing using the magick package. We make each image transparent with the image_transparent() function. We can also make the resulting image a specific color with image_colorize().

I then saved the images using the image_write() function. I manually re-uploaded them to GH.

This was a great example of where laying icons on a map works.

R User Salaries By Country

Capri Granville shares a chart showing a box plot of salaries for professional R users by country:

Interesting analysis done in R, about salaries of R developers broken down by country, featuring salary range and median salary. 

The dataset consists of survey answers from nearly 90,000 respondents. About 5,000 of them reported using R for “extensive development work over the past year”. The first filter used reduces the dataset from 88,883 respondents to 5,048. The second filter excludes students, hobby programmers and former developers. This reduces the dataset to 4,047 respondents. The third filter excludes unemployed and retired respondents and the dataset is further reduced to 3,871 respondents. Finally, we exclude respondents from an unknown country and respondents with unknown or zero salary.

Check out Tomaz Weiss’s detailed post which dives into these numbers for United States respondents.

Python and R Data Reshaping

John Mount takes us through a couple of data shaping packages:

The advantages of data_algebra and cdata are:

– The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
– The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
– The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Let’s re-work a small R cdata example, using the Python package data_algebra.

Click through for the example.

Exploratory Data Analysis with ExPanDaR

Joachim Gassen walks us through the ExPanDaR package in R:

The ‘ExPanDaR’ package offers a toolbox for interactive exploratory data analysis (EDA). You can read more about it here. The ‘ExPanD’ shiny app allows you to customize your analysis to some extent but often you might want to continue and extend your analysis with additional models and visualizations that are not part of the ‘ExPanDaR’ package.

Thus, I am currently developing an option to export the ‘ExPanD’ data and analysis to an R Notebook. While it is not ready for CRAN yet, it seems to work reasonably well and I would love to see some people trying it and letting me know about any bugs or other issues that they encounter. Hence, this blog post.

Looks like an interesting package. H/T R-bloggers

Calculating Consistency of Ratings

Sebastian Sauer looks at computing reliability between raters:

Computing inter-rater reliability is a well-known, albeit maybe not very frequent task in data analysis. If there’s only one criteria and two raters, the proceeding is straigt forward; Cohen’s Kappa is the most widely used coefficient for that purpose. It is more challenging to compare multiple raters on one criterion; Fleiss’ Kappa is one way to get a coefficient. If there are multiple criteria, one way is to compute the mean of multiple Fleiss’ coefficients.

However, a different way, and the way presented in this post, consists of checking of all raters agree on one given item (and repeating that for all items). If rater A assigns two tags/criteria (tag1, tag2) to item A, then the other raters may not assign different tags (eg tag3, tag4) to that item, if a match should be scored. Note that this proceeding allows for different numbers of tags/criteria for the items (eg., item 1 with only 1 tag, but item 2 with 3 tags etc.). However, our grading should give some points, if, say, rater1 assigns tag1 and tag2, but raters 2 and 3 only assign tag1.

Read the whole thing.

Update to ggraph

Thomas Lin Pedersen has an update to ggraph:

If you are new to ggraph, a short description follows: It is an extension of ggplot2 that implement an extended grammar for relational data (e.g. trees and networks). It provides a huge variety of geoms for drawing nodes and edges, along with an assortment of layouts making it possible to produce a very wide range of network visualization types. It is to my knowledge the most feature packed network visualization framework available in R (and potentially in other languages as well), all building on top of the familiar ggplot2 API. If you want to learn more I invite you to browse the new pkgdown website that has been made available.

It looks really nice.

Aggregating in R

Misha at Data Sharkie takes us through the aggregate function in R:

The aggregate() function is already built into R so we don’t need to install any additional packages.

The very brief theoretical explanation of the function is the following:

aggregate(data, by= , FUN= )

Read on to see how this works as well as examples.

From Excel to R: Three Examples

Kevin Feasel



Abdul Majed Raja has a few examples of things which are easy to do in Excel and how you can do them in R:

Create a difference variable between the current value and the next value
This is also known as lead and lag – especially in a time series dataset this varaible becomes very important in feature engineering. In Excel, This is simply done by creating a new formula field and subtracting the next cell with the current cell or the current cell with the previous cell and dragging the cell formula to the last cell.

These things aren’t as easy to do as in Excel—it’s hard to get simpler than “push a button” or “click and drag your mouse”—but they are useful to know in R. H/T R-bloggers

Calculating AUC in R

Andrew Treadway shows how you can calculate Area Under the Curve in R:

AUC is an important metric in machine learning for classification. It is often used as a measure of a model’s performance. In effect, AUC is a measure between 0 and 1 of a model’s performance that rank-orders predictions from a model. For a detailed explanation of AUC, see this link.

Since AUC is widely used, being able to get a confidence interval around this metric is valuable to both better demonstrate a model’s performance, as well as to better compare two or more models. For example, if model A has an AUC higher than model B, but the 95% confidence interval around each AUC value overlaps, then the models may not be statistically different in performance. We can get a confidence interval around AUC using R’s pROC package, which uses bootstrapping to calculate the interval.

There are plenty of ways to calculate this useful metric, but this is definitely one of the easier methods. H/T R-bloggers

Python versus R (Again)

Alex Woodie looks at whether Python is dominating R in the data science space:

There is some evidence that Python’s popularity is hurting R usage. According to the TIOBE Index, Python is currently the third most popular language in the world, behind perennial heavyweights Java and C. From August 2018 to August 2019, Python usage surged by more than 3% to achieve a 10% rating (TIOBE’s proprietary metric that primarily measures search activity), easily the biggest gain among the 20 most popular languages.

R, by contrast, has not fared well lately on the TIOBE Index, where it dropped from 8th place in January 2018 to become the 20th most popular language today, behind Perl, Swift, and Go. At its peak in January 2018, R had a popularity rating of about 2.6%. But today it’s down to 0.8%, according to the TIOBE index.

I’ll say that rumors of R’s demise are premature.


September 2019
« Aug