Press "Enter" to skip to content

Category: R

Getting Started with data.table

Gary Hutson has a primer on data.table:

This example uses the copy data frame we made and uses the organisation code by the type of attendances. I want to then summarise the mean admissions by type and organisation code.

Pivots can be implemented in data.table in the following way:

I’ve never been the biggest fan of the syntax for data.table but the performance is unquestionably there and that makes it worth learning. H/T R-bloggers.

Comments closed

Spring Cleaning Shiny Projects

Mirai Solutions has some tips on cleaning up Shiny apps:

How to apply the spring cleaning principles and advanced programming to your Shiny App.

1. Deep breeze and allocate some time

Do not avoid spring cleaning simply because you don’t know where to start from. Prioritize some time for the task and get inspired by our following points.

Click through for advice on tools and processes to make this code easier to understand. H/T R-Bloggers

Comments closed

Including and Resizing External Images in knitr

The folks at Jumping Rivers continue a series on knitr and rmarkdown:

In this third post, we’ll look at including eternal images, such as figures and logos in HTML documents. This is relevant for all R markdown files, including fancy things like {bookdown}, {distill} and {pkgdown}. The main difference with the images discussed in this post, is that the image isn’t generated by R. Instead, we’re thinking of something like a photograph. When including an image in your web-page, the two key points are

– What size is your image?
– What’s the size of your HTML/CSS container on your web-page?

Read the whole thing.

Comments closed

Tidying the Confusion Matrix in R

Gary Hutson has a new package for us:

The package aim is to make it easier to convert the outputs of the lists from caret and collapse these down into row-by-row entries, specifically designed for storing the outputs in a database or row by row data frame.

This is something that the CARET library does not have as a default and I have designed this to allow the confusion matrix outputs to be stored in a data frame or database, as many a time we want to track the ML outputs and fits over time to monitor feature slippage and changes in the underlying patterns of the data.

I like the way caret shows the confusion matrix when I’m reviewing result on my own, but I definitely appreciate efforts to make it easier to handle within code—similar to how broom reads linear regression outputs. H/T R-bloggers

Comments closed

Research with R and Production with Python

Matt Dancho and Jarrell Chalmers lay out an argument:

The decision can be challenging because they both Python and R have clear strengths.

R is exceptional for Research – Making visualizations, telling the story, producing reports, and making MVP apps with Shiny. From concept (idea) to execution (code), R users tend to be able to accomplish these tasks 3X to 5X faster than Python users, making them very productive for research.

Python is exceptional for Production ML – Integrating machine learning models into production systems where your IT infrastructure relies on automation tools like Airflow or Luigi.

They make a pretty solid argument. I’ve launched success R-based projects using SQL Server Machine Learning Services, but outside of ML Services, my team’s much more likely to deploy APIs in Python, and we’re split between Dash and Shiny for visualization. H/T R-Bloggers

Comments closed

Non-Equi Joins in R

David Selby walks us through non-trivial join scenarios in R:

Most joins are equi-joins, matching rows according to two columns having exactly equal values. These are easy to perfom in R using the base merge() function, the various join() functions in dplyr and the X[i] syntax of data.table.

But sometimes we need non-equi joins or θ-joins, where the matching condition is an interval or a set of inequalities. Other situations call for a rolling join, used to link records according to their proximity in a time sequence.

How do you perform non-equi joins and rolling joins in R?

Click through for the answer using dplyr, sqldf, and data.table. H/T R-bloggers

Comments closed

Polychoric Correlation in Practice

Jack Davis explains the concept of polychoric correlation:

In polychoric correlation, we don’t need to know or specify where the boundary between “good” and “very good” is, just that it exists. The distribution of the ordinal responses, along with the assumption that the latent values follow a normal distribution, is enough that the polychor() function in the polycor R package can do that for us. In most practical cases, you don’t even need to know where the cutoffs are, but they are useful for demonstration that the method works.

Polychoric correlation estimates the correlation between such latent variables as if you actually knew what those values were. In the examples given, we start with the latent variables and use cutoffs to set them into bins, and then use polychoric on the artificially binned data. In any practical use case, the latent data would be invisible to you, and the cutoffs would be determined by whoever designed the survey.

Read on for a demonstration of the process in R.

Comments closed

Using the Pipe Operator in ggplot2

Tomaz Kastrun reduces the number of pipe-like operators:

Using pipe %>% or chaining commands is very intuitive for creating following set of commands for data preparation. Visualization library ggplot in this manner uses sign “+” (plus) to do all the chaining. What if we would have to replace with the pipe sign?

This is because ggplot was developed prior to magrittr took over the piping world in R, so there wasn’t a “normal” pipe. I had been hopeful that ggvis would take over, as it does use the %>% pipe, but that project has gone dormant.

Comments closed

Image Sizing in RMarkdown Documents

The Jumping Rivers team shares some insight on image creation:

In this series of posts we’ll consider the (simple?) task of generating and including figures for the web using R & {knitr}. Originally this was going to be a single post, but as the length increase, we’ve decided to separate it into a separate articles. The four posts we intend to cover are

– setting the image size (this post)
– selecting the image type, PNG vs JPEG vs SVG
– including non-generated files in a document
– setting global {knitr} options.

Read on for the first post in the series.

Comments closed

Plotting Multiple Plots in R using map and ggplot

Sebastian Sauer gives us a quick solution to plotting one graph per variable:

Say we have a data frame where we would like to plot each numeric variables’s distribution.

There are a number of good solutions outthere such as this one, or here, or here.

When I read this, my first thought was along the lines of, “Why not use facets or something like cowplot?” But then it clicked that this is per-variable plotting, whereas faceting requires you choose a variable and see the plots based on that variable’s distinct values..

Comments closed