Category: R

Solving the Prisoner Coin Flipping Puzzle with R

Published 2020-05-05 by Kevin Feasel

David Robinson takes us through another problem-solving challenge:

You are locked in the dungeon of a faraway castle with three fellow prisoners (i.e., there are four prisoners in total), each in a separate cell with no means of communication. But it just so happens that all of you are logicians (of course)….
Each prisoner will be given a fair coin, which can either be fairly flipped one time or returned to the guards without being flipped. If all flipped coins come up heads, you will all be set free! But if any of the flipped coins comes up tails, or if no one chooses to flip a coin, you will all be doomed to spend the rest of your lives in the castle’s dungeon.
The only tools you and your fellow prisoners have to aid you are random number generators, which will give each prisoner a random number, uniformly and independently chosen between zero and one.
What are your chances of being released?
I’ll solve this with tidy simulation in R, in particular using one of my favorite functions, tidyr’s crossing(). In an appendix, I’ll show how to get a closed form solution for N = 4.
I’ve also posted a 30-minute screencast of how I first approached the simulation and visualization.

Click through for the solution and explanation.

Comments closed

R 4.0 Released

Published 2020-04-28 by Kevin Feasel

David Smith walks us through what’s new in R 4.0:

R 4.0.0 was released in source form on Friday, and binaries for Windows, Mac and Linux are available for download now.
As the version number bump suggests, this is a major update to R that makes some significant changes. Some of these changes — particularly the first one listed below — are likely to affect the results of R’s calculations, so I would not recommend running scripts written for prior versions of R without validating them first. In any case, you’ll need to reinstall any packages you were using for R 4.0.0. (You might find this R script useful for checking what packages you have installed for R 3.x.)

And I just got 3.6 into production yesterday. Them’s the breaks…

Comments closed

The Siren Song of High Accuracy

Published 2020-04-28 by Kevin Feasel

Holger von Jouanne-Diedrich notes that accuracy is not in itself necessarily a good thing for a machine learning model:

In one of my most popular posts So, what is AI really? I showed that Artificial Intelligence (AI) basically boils down to autonomously learned rules, i.e. conditional statements or simply, conditionals.
In this post, I create the simplest possible classifier, called ZeroR, to show that even this classifier can achieve surprisingly high values for accuracy (i.e. the ratio of correctly predicted instances)… and why this is not necessarily a good thing, so read on!

The nuanced answer here is that with classifiers, accuracy is not in itself a great measure in the case of class imbalance. The more balanced your classes are, the more likely it is that a model with high accuracy is a good model. That’s where other measures such as specificity and sensitivity, positive & negative predictive value, etc. come into play.

Comments closed

Useful R Packages for Data Scientists

Published 2020-04-27 by Kevin Feasel

Tomaz Kastrun has a nice collection of useful R packages for data scientists:

Among thousand of R packages available on CRAN (with all the mirror sites) or Github and any developer’s repository.
Many useful functions are available in many different R packages, many of the same functionalities also in different packages, so it all boils down to user preferences and work, that one decides to use particular package. From the perspective of a statistician and data scientist, I will cover the essential and major packages in sections. And by no means, this is not a definite list, and only a personal preference.

Click through for Tomaz’s recommendations.

Comments closed

R Checkpoint Package Update Now in Beta

Published 2020-04-22 by Kevin Feasel

Hong Ooi announces that a revamp of the checkpoint package is now in beta:

Checkpoint has been around for nearly 6 years now, helping R users solve the reproducible research puzzle. In that time, it’s seen many changes, new features, and, inevitably, bug reports. Some of these bugs have been fixed, while others remain outstanding in the too-hard basket.
Many of these issues spring from the fact that it uses only base R functions, in particular install.packages, to do its work. The problem is that install.packages is meant for interactive use, and as an API, is very limited. For starters, it doesn’t return a result to the caller—instead, checkpoint has to capture and parse the printed output to determine whether the installation succeeded. This causes a host of problems, since the printout will vary based on how R is configured. Similarly, install.packages refuses to install a package if it’s in use, which means checkpoint must unload it first—an imperfect and error-prone process at best.
In addition to these, checkpoint’s age means that it has accumulated a significant amount of technical debt over the years. For example, there is still code to handle ancient versions of R that couldn’t use HTTPS, even though the MRAN site (in line with security best practice) now accepts HTTPS connections only.

Click through to see what’s in the new checkpoint package.

Comments closed

Cross-Validation in R with crossval

Published 2020-04-20 by Kevin Feasel

Thierry Moudiki shows off some functionality in the the crossval package:

In this post, I present some examples of use of crossval on a linear model, and on the popular xgboost and randomForest models. The error measure used is Root Mean Squared Error (RMSE), and is currently the only choice implemented.

Click through for the demonstration in notebook form. H/T R-Bloggers.

Comments closed

Time Series Forecasting Best Practices

Published 2020-04-15 by Kevin Feasel

David Smith talks about a new GitHub repo:

The repository includes detailed examples of various time series modeling techniques, as Jupyter Notebooks for Python, and R Markdown documents for R. It also includes Python notebooks to fit time series models in the Azure Machine Learning service, and then operationalize the forecasts as a web service.
The R examples demonstrate several techniques for forecasting time series, specifically data on refrigerated orange juice sales from 83 stores (sourced from the the bayesm package). The forecasting techniques vary (mean forecasting with interpolation, ARIMA, exponential smoothing, and additive models), but all make extensive use of the tidyverts suite of packages, which provides “tidy time series forecasting for R“. The forecasting methods themselves are explained in detail in the book (readable online) Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos (Monash University).

This looks really cool.

Comments closed

Understanding ROC Curves

Published 2020-04-14 by Kevin Feasel

Holger von Jouanne-Diedrich explains the concept of ROC curves:

One widely used graphical plot to assess the quality of a machine learning classifier or the accuracy of a medical test is the Receiver Operating Characteristic curve, or ROC curve. If you want to gain an intuition and see how they can be easily created with base R read on!

I like this explanation a lot.

Comments closed

Changing the Graphics Device in RMarkdown Docs

Published 2020-04-14 by Kevin Feasel

Colin Gillespie shows us how to change PDF and PNG output settings within knitr:

In many workflows, function calls to graphic devices are not explicit. Instead, the call is made by another package, such as knitr.
When kniting an Rmarkdown document, the default graphics device when creating PDF documents is grDevices::pdf() and for HTML documents it’s grDevices::png(). As we demostrated, these are the worst possible choices!

Click through to see what you can do about it.

Comments closed

Tidy Simulation of Stochastic Processes in R

Published 2020-04-14 by Kevin Feasel

David Robinson shows off my favorite distribution:

The Riddler puzzle describes a Poisson process, which is one of the most important stochastic processes. A Poisson process models the intuitive concept of “an event is equally likely to happen at any moment.” It’s named because the number of events occurring in a time interval of length is distributed according to , for some rate parameter (for this puzzle, the rate is described as one per day, ).
How can we simulate a Poisson process? This is an important connection between distributions. The waiting time for the next event in a Poisson process has an exponential distribution, which can be simulated with rexp().

Read on to learn about the Poisson distribution and Yule processes.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31