Press "Enter" to skip to content

Category: R

Timing R Function Calls

Colin Gillespie shows off an R package for benchmarking:

Of course, it’s more likely that you’ll want to compare more than two things. You can compare as many function calls as you want with mark(), as we’ll demonstrate in the following example. It’s probably more likely that you’ll want to compare these function calls against more than one value. For example, in the digest package there are eight different algorithms. Ranging from the standard md5 to the newer xxhash64 methods. To compare times, we’ll generate n = 20 random character strings of length N = 10,000. This can all be wrapped up in the single function press() function call from the bench package:

Click through for an example involving hashing algorithms.

Comments closed

Exploratory Data Analysis with inspectdf

Laura Ellis continues a dive into Exploratory Data Analysis, this time using the inspectdf package:

I like this package because it’s got a lot of functionality and it’s incredibly straightforward to use. In short, it allows you to understand and visualize column types, sizes, values, value imbalance & distributions as well as correlations. Better yet, you can run each of these features for an individual data frame, or compare the differences between two data frames.

I liked the inspectdf package so much that in this blog, I’m going to extend my previous EDA tutorial with an overview of the package.

There are some interesting functions which make EDA easier, so check it out.

Comments closed

MRAN Changes and a Survey

David Smith discusses potential changes to MRAN:

As CRAN has grown and changes to packages have become more frequent, maintaining MRAN is an increasingly resource-intensive process. We’re contemplating changes, like changing the frequency of snapshots, or thinning the archive of snapshots that haven’t been used. But before we do that we’d  like to hear from the community first. Have you used MRAN snapshots? If so, how are you using them? How many different snapshots have you used, and how often do you change that up? Please leave your feedback at the survey link below by June 14, and we’ll use the feedback we gather in our decision-making process. Responses are anonymous, and we’ll summarize the responses in a future blog post. Thanks in advance!

Please take the survey as well. If you’ve used SQL Server Machine Learning Services (or SQL Server R Services), you’ve used MRAN.

Comments closed

Vectors for Programmers

John Mount has a couple of videos available:

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.

Click through for the links, one with Python examples and the other with R examples.

Comments closed

Defining Tidy Data

John Mount shares thoughts about the concept of tidy data:

A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner alone, and not a fact about the joint row keys (the tournament plus year) as required by the rules of Codd’s 3rd normal form. The critique being: this data presentation does not express the intended data invariant that Al Fredrickson must have the same “Winner Date of Birth” in all rows.

My spin on it is that tidy data is Boyce-Codd Normal Form but may subsequently be denormalized. This may reintroduce violations of 3NF (as in Mount’s example) and sometimes 2NF, but does not change the shape of the variables themselves—that is, a variable still represents a single thing and exists per observation.

Comments closed

Visualizing Earthquake Data

Giorgio Garziano continues a series on analyzing earthquake data:

This is the third part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific 30 days time span. In this post, we are going to show static, interactive and animated earthquakes maps of different flavors by using the functionalities provided by a pool of R packages as specifically explained herein below.

Giorgio looks at 9 separate R mapping packages, so you get your money’s worth here.

Comments closed

Changes in R 3.6.0

David Smith lays out the major changes in R 3.6.0:

A major update to the open-source R language, R 3.6.0, was released on April 26 and is now available for download for Windows, Mac and Linux. As a major update, it has many new features, user-visible changes and bug fixes. You can read the details in the release announcement, and in this blog post I’ll highlight the most significant ones.

There are some good changes in here.

Comments closed

Horizontal Labels with ggplot

Michael Toth shows us how to ensure we use horizontal text labels in ggplot:

There are several things we could do to improve this graph, but in this guide let’s focus on rotating the y-axis label. This simple change will make your graph so much better. That way, people won’t have to tilt their heads like me to understand what’s going on in your graph:

It may not seem like much when you’re creating the visual, but it can make a difference for a viewer.

Comments closed

Repeated Cross-Validation in R

Ludvig Olsen walks us through a couple of nice R packages:

The benefits of using groupdata2 to create the folds are 1) that it allows us to balance the ratios of our output classes (or simply a categorical column, if we are working with linear regression instead of classification), and 2) that it allows us to keep all observations with a specific ID (e.g. participant/user ID) in the same fold to avoid leakage between the folds.

The benefit of cvms is that it trains all the models and outputs a tibble (data frame) with results, predictions, model coefficients, and other sweet stuff, which is easy to add to a report or do further analyses on. It even allows us to cross-validate multiple model formulas at once to quickly compare them and select the best model.

Ludvig also gives us some examples of how both packages can help you out. H/T R-Bloggers

Comments closed