Press "Enter" to skip to content

Category: R

magrittr’s Four Pipes

Gregory Janesch shows off the various pipes in the magrittr R package:

The magrittr package is a part of the extended tidyverse – i.e., not one of the ones normally loaded. It is the one that supplies the pipe operator (%>%), but it turns out that the package actually contains four pipe operators in total. All are intended to streamline and improve the readability of code, though the three non-basic ones are a bit more situational, and I’ve rarely seen them used, so I thought I would go into them a bit.

The CRAN page for magrittr is here; much of this post is based off of the package’s vignettes and documentation.

Click through for demonstrations of each. I’ve only seen the basic pipe in use as well, but the others look quite interesting and I can see use cases where knowing about them would be helpful. Also, note in the comments about the secret 5th pipe. H/T R-Bloggers

Comments closed

Subgroup Analysis via Bayesian Hierarchical Modeling

Keith Goldfield ponders subgroup analysis:

Which got me thinking, of course, about subgroup analyses. In the context of a null hypothesis significance testing framework, it is well known that conducting numerous post hoc analyses carries the risk of dramatically inflating the probability of a Type 1 error – concluding there is some sort of effect when in fact there is none. So, if there is no overall effect, and you decide to look at a subgroup of the sample (say patients over 50), you may find that the treatment has an effect in that group. But, if you failed to adjust for multiple tests, than that conclusion may not be warranted. And if that second subgroup analysis was not pre-specified or planned ahead of time, that conclusion may be even more dubious.

If we use a Bayesian approach, we might be able to avoid this problem, and there might be no need to adjust for multiple tests. I have started to explore this a bit using simulated data under different data generation processes and prior distribution assumptions. It might all be a bit too much for a single post, so I am planning on spreading it out a bit.

Read on for two separate Bayesian model approaches to the problem. H/T R-Bloggers.

Comments closed

Using tsoutliers() to Detect Time Series Outliers

Rob J. Hyndman shows off a function in the forecast package in R:

The tsoutliers() function in the forecast package for R is useful for identifying anomalies in a time series. However, it is not properly documented anywhere. This post is intended to fill that gap.

The function began as an answer on CrossValidated and was later added to the forecast package because I thought it might be useful to other people. It has since been updated and made more reliable.

Read on to see how it works. This is one of the reasons I like the R programming language so much for data analysis and statistics: usually, somebody smarter than me has already built a solution to the problem and it’s just a matter of finding the right function. H/T R-Bloggers

Comments closed

Estimating the Likelihood of an Underdog Winning at Soccer

Holger von Jouanne-Diedrich lays out the math for us:

The Bundesliga is Germany’s primary football league. It is one of the most important football leagues in the world, broadcast on television in over 200 countries.

If you want to get your hands on a tool to forecast the result of any game (and perform some more statistical analyses), read on!

What I would like is a tool which has SC Freiburg utterly dominating Bayern. Said tool may be more mythological than scientific (or at least a copy of Football Manager and a little bit of save scumming…), but I’ll take it.

Comments closed

From API Call to ML Services Prediction

Tomaz Kastrun continues a series:

From the previous two blog posts:

Creating REST API for reading data from Microsoft SQL Server in web browser

Writing Data to Microsoft SQL Server from web browser using REST API and node.js

We have looked into the installation process of Node.js, setup of Microsoft SQL Server and made couple of examples on reading the data from database through REST API and how to insert data back to database.

In this post, we will be looking the R predictions using API calls against a sample dataset.

Click through to see it in action.

Comments closed

A Learning Path for Data Science with R

Holger von Jouanne-Diedrich has a greatest hits album:

Over the course of the last two and a half years, I have written over one hundred posts for my blog “Learning Machines” on the topics of data science, i.e. statistics, artificial intelligence, machine learning, and deep learning.

I use many of those in my university classes and in this post, I will give you the first part of a learning path for the knowledge that has accumulated on this blog over the years to become a well-rounded data scientist, so read on!

Read on for links to dozens of posts on interesting topics.

Comments closed

BCP from R into SQL Server

Thomas Roh shows how you can perform bulk insert operations into SQL Server using the bcputility package in R:

Writing large datasets to SQL Server can be very slow using the DBI package with an odbc connection. The issue with writing data is that individual INSERT statements are generated for each row of data. I’ve also had issues with remote connections that can make large writes to SQL Server take a very long time. SQL Server Management Studio does provide a GUI interface to import data that is much more efficient. For those that want to include the data import in their reproducible R workflows there are a couple of options.

Read on to see how it works. It’s still calling bcp.exe under the covers, so expect similar foibles using it as you would bcp. H/T R-Bloggers.

Comments closed

Performance Tips when Working with Large Datasets in R

Mira Celine Klein continues a series on performance tuning R code:

Whether your dataset is “large” not only depends on the number of rows, but also on the method you are going to use. It’s easy to compute the mean or sum of as many as 10,000 numbers, but a nonlinear regression with many variables can already take some time with a sample size of 1,000.

Sometimes it may help to parallelize (see part 3 of the series). But with large datasets, you can use parallelization only up to the point where working memory becomes the limiting factor. In addition, there may be tasks that cannot be parallelized at all. In these cases, the strategies from part 2 of this series may be helpful, and there are some more ways:

Click through for four options.

Comments closed

Caching Function Results in an R Package

Maelle Salmon and Cristophe Dervieux show us ways to cache results of function calls using R:

Caching means that if you call a function several times with the exact same input, the function is only actually run the first time. The result is stored in a cache of some sort (more practical details later!). Every other time the function is called with the same input, the result is retrieved from the cache unless invalidated. You will often think of caching as something valid in only one R session, but we’ll see it can be persistent across sessions via storage on disk.

As a quick note, this makes sense when writing functions, which are expressions without side effects. If you have side effects, caching might not give you what you expect.

Comments closed