Press "Enter" to skip to content

Category: R

Random Sequences and Probabilities

Holger von Jouanne-Diedrich explains the results of a poll:

Some time ago I conducted a poll on LinkedIn that quickly went viral. I asked which of three different coin tossing sequences were more likely and I received exactly 1,592 votes! Nearly 48,000 people viewed it and more than 80 comments are under the post (you need a LinkedIn account to fully see it here: LinkedIn Coin Tossing Poll).

In this post I will give the solution with some background explanation, so read on!

Read on to understand why it’s just as likely that you’ll see a sequence, when flipping a coin, of H,H,H,H,H,H just as often as you’ll see H,T,H,T,H,T.

Comments closed

Troubleshooting Code Performance in R

Mira Celine Klein shows how to benchmark R code performance:

Let’s assume you have written some code, it’s working, it computes the results you need, but it is really slow. If you don’t want to get slowed down in your work, you have no other choice than improving the code’s performance. But how to start? The best approach is to find out where to start optimizing.

It is not always obvious which part of the code makes it so slow, or which of multiple alternatives is fastest. There is the risk to spending a lot of time optimizing the wrong part of the code. Fortunately, there are ways to systematically test how long a computation takes. An easy way is the function system.time. Just wrap your code into this function, and you will (in addition to the actual results of that code) get the time your code took to run.

But that’s not the only route—read on to learn about other techniques as well and see them in action.

Comments closed

Check Those Feature Distributions

Antoine Rebecq shares a warning:

I was recently working on a cool dataset that looked unusually friendly. It was tidy, neat, interesting… the kind of things that you rarely encounter in the wild! My goal was to build a super simple predictor for one of the features. However, I kept getting poor results and at first couldn’t figure out what was happening.

There’s some good, practical advice in there, so check it out. H/T R-Bloggers

Comments closed

Backtesting Options Strategies in R

Holger von Jouanne-Diedrich is in the money:

Options trading strategies are strategies where you combine, often several, derivatives instruments to create a certain risk-return profile (more on that here: Financial Engineering: Static Replication of any Payoff Function). Often we want to know how those strategies would fare in the real world.

The problem is that real data on derivatives are hard to come by and/or very expensive. But we help ourselves with a very good proxy: implied volatility which is freely available for example for many indices. With that, we can use the good old Black-Scholes model to reasonably price options whose strikes are not too far away from the current price of the underlying.

Read on to see how.

Comments closed

Simulating Prediction Intervals

Bryan Shalloway continues a series:

Part 1 of my series of posts on building prediction intervals used data held-out from model training to evaluate the characteristics of prediction intervals. In this post I will use hold-out data to estimate the width of the prediction intervals directly. Doing such can provide more reasonable and flexible intervals compared to analytic approaches.

Click through for the article, and be sure to check out part 1 if you haven’t already.

Comments closed

spkarlyr 1.6 Released

Carly Driggers announces a new release of sparklyr:

Sparklyr, an LF AI & Data Foundation Incubation Project, has released version 1.6! Sparklyr is an R Language package that lets you analyze data in Apache Spark, the well-known engine for big data processing, while using familiar tools in R. The R Language is widely used by data scientists and statisticians around the world and is known for its advanced features in statistical computing and graphics. 

Click through to see the changes.

Comments closed

Working with Prediction Intervals

Bryan Shalloway explains how generating prediction intervals is different from making point predictions:

Before using the model for predictive inference, one should have reviewed overall performance on a holdout dataset to ensure the model is sufficiently accurate for the business context. For example, for our problem is an average error of ~12% and 90% prediction intervals of +/- ~25% of Sale_Price useful? If the answer is “no,” that suggests the need for more effort in improving the accuracy of the model (e.g. trying other transformations, features, model types). For our examples we are assuming the answer is ‘yes,’ our model is accurate enough (so it is appropriate to move-on and focus on prediction intervals).

Click through for the article.

Comments closed

Last Observation Carried Forward in R

Nathan Eastwood shows how to perform Last Observation Carried Forward in R:

Real life data is often riddled with missing values – or NAs – where no data value are stored for the variable in observation. Missing data such as this can have a significant effect on the conclusions which can be drawn from the data. For example individuals dropping out of a study or subjects not properly reporting responses. A common solution to this problem is to fill those NA values with the most recent non-NA value prior to it – this is called the Last Observation Carried Forward (LOCF) method.

This blog post will look at how to implement this particular solution using a combination of {dplyr}, {dbplyr} and {sparklyr}.

Click through for the solution.

Comments closed

Generating Random Numbers in R

Holger von Jouanne-Diedrich brings the noise:

In data science, we try to find, sometimes well-hidden, patterns (= signal) in often seemingly random data (= noise). Pseudo-Random Number Generators (PRNG) try to do the opposite: hiding a deterministic data generating process (= signal) by making it look like randomness (= noise). If you want to understand some basics behind the scenes of this fascinating topic, read on!

Click through for an explanation of the process.

Comments closed

Creating a Rose Chart in R

Neil Saunders takes a look at a classic chart:

I first heard Florence Nightingale and her Geeks Declare War on Death, an episode of the Cautionary Tales podcast, premiered as a special episode of 99% Invisible. It discusses Nightingale’s work as a statistician and in particular, her visualisation of mortality causes in the Crimean War using the famous “rose chart”, or polar area diagram.

I’m sure you’re thinking: how can I explore that using R? 

Read on to find out.

Comments closed