Press "Enter" to skip to content

Category: R

Dependencies as Risks

John Mount makes the point that packages dependencies are innately a risk:

If your software or research depends on many complex and changing packages, you have no way to establish your work is correct. This is because to establish the correctness of your work, you would need to also establish the correctness of all of the dependencies. This is worse than having non-reproducible research, as your work may have in fact been wrong even the first time.

Low dependencies and low complexity dependencies can also be wrong, but in this case there at least exists the possibility of checking things or running down and fixing issues.

There are some insightful comments on this post as well, so check those out. This is definitely an area where there are trade-offs, so trying to reason through when to move in which direction is important.

Comments closed

Custom ggplot2 Fonts

Daniel Oehm shares two techniques for using custom fonts in your ggplot2 visuals:

ggplot – You can spot one from a mile away, which is great! And when you do it’s a silent fist bump. But sometimes you want more than the standard theme.

Fonts can breathe new life into your plots, helping to match the theme of your presentation, poster or report. This is always a second thought for me and need to work out how to do it again, hence the post.

There are two main packages for managing fonts – extrafont, and showtext.

Read on to see how to use each of these packages. H/T R-bloggers

Comments closed

Unit Testing R Code

John Mount points out that you don’t need special infrastructure to perform unit testing in R:

There seems to be a general (false) impression among non R-core developers that to run tests, R package developers need a test management system such as RUnit or testthat. And a further false impression that testthat is the only R test management system. This is in fact not true, as R itself has a capable testing facility in “R CMD check” (a command triggering R checks from outside of any given integrated development environment).

By a combination of skimming the R-manuals ( https://cran.r-project.org/manuals.html ) and running a few experiments I came up with a description of how R-testing actually works. And I have adapted the available tools to fit my current preferred workflow. This may not be your preferred workflow, but I have and give my reasons below.

Food for thought for any R developer.

Comments closed

R 3.5.3 Available

David Smith shares some info on R 3.5.3, released on Monday:

The R Core Team announced yesterday the release of R 3.5.3, and updated binaries for Windows and Linux are now available (with Mac sure to follow soon). This update fixes three minor bugs (to the functions writeLinessetClassUnion, and stopifnot), but you might want to upgrade just to avoid the “package built under R 3.5.4” warnings you might get for new CRAN packages in the future.

Click through for more info on this release, including where the name from each R release comes from.

Comments closed

Accidentally Building a Population Graph

Neil Saunders shares an example of a newspaper headline which ultimately just shows us population sizes:

Some poking around in the NSW Transport Open Data portal reveals how many people enter every Sydney train station on a “typical” day in 2016, 2017 and 2018. We could manipulate those numbers in various ways to estimate total, unique passengers for FY 2017-18 but I’m going to argue that the value as-is serves as a proxy variable for “station busyness”.

When working with spatial data cases, it’s important to differentiate an effect you see because it’s actually unique or interesting versus an effect you see because that’s where all of the people are.

Comments closed

Identifying Distributions with knn in R

Abhijit Telang has an interesting post on identifying arbitrary distributions with the k-nearest-neighbor algorithm in R:

You can easily see how arbitrary the shapes can be almost magically discovered, through the principle of the nearest neighbor search.

The magic happens because the methodical approach of meeting and greeting the neighbors discovers more and more neighbors (and hence the visualization becomes denser and denser) as per the formation of the shape, and on the other hand, sparser and sparser as the traversal approaches the contours of those very shapes. The sparseness around the dense shapes provides the much-needed contrast to discover hidden shapes.

Read on for a very interesting explanation.

Comments closed

New Version of ggforce Available

Thomas Lin Pedersen announces a new version of ggforce for R:

If there is one thing of general utility lacking in ggplot2 it is probably the ability to annotate data cleanly. Sure, there’s geom_text()/geom_label()but using them requires a fair bit of fiddling to get the best placement and further, they are mainly relevant for labeling and not longer text. ggrepelhas improved immensely on the fiddling part, but the lack of support for longer text annotation as well as annotating whole areas is still an issue.

In order to at least partly address this, ggforce includes a family of geoms under the geom_mark_*() moniker. They all behaves equivalently except for how they encircle the given area(s). 

There are some really interesting features in the ggforce package, so check them out.

Comments closed

Parameters in Rmarkdown Files

Neil Saunders shows how you can parameterize Rmarkdown files, consequently making changes easy later:

The reports follow a common template where the major difference is simply the hashtag. So one way to create these reports is to use the previous one, edit to find/replace the old hashtag with the new one, and save a new file.

That works…but what if we could define the hashtag once, then reuse it programmatically anywhere in the document? Enter Rmarkdown parameters.

The example is small but important.

Comments closed

Multi-Server Real-Time Scoring with R

David Smith points out a reference architecture for real-time scoring with R distributed through Kubernetes:

Let’s say you’ve developed a predictive model in R, and you want to embed predictions (scores) from that model into another application (like a mobile or Web app, or some automated service). If you expect a heavy load of requests, R running on a single server isn’t going to cut it: you’ll need some kind of distributed architecture with enough servers to handle the volume of requests in real time.

This reference architecture for real-time scoring with R, published in Microsoft Docs, describes a Kubernetes-based system to distribute the load to R sessions running in containers.

Looks interesting.

Comments closed

Pivot Tables and Grouping Sets With data.table in R

Jozef Hajnala shows how you can use data.table to perform fast pivoting and arbitrary grouping of data in R:

Data manipulation and aggregation is one of the classic tasks anyone working with data will come across. We of course can perform data transformation and aggregation with base R, but when speed and memory efficiency come into play, data.table is my package of choice.

In this post we will look at of the fresh and very useful functionality that came to data.table only last year – grouping sets, enabling us, for example, to create pivot table-like reports with sub-totals and grand total quickly and easily.

Grouping sets are also available in SQL dialects and tend to be something people tend not to be aware of. This is a shame because they’re quite powerful, and Jozef shows how powerful they can be in R.

Comments closed