Press "Enter" to skip to content

Category: R

Troubleshooting Code Performance in R

Mira Celine Klein shows how to benchmark R code performance:

Let’s assume you have written some code, it’s working, it computes the results you need, but it is really slow. If you don’t want to get slowed down in your work, you have no other choice than improving the code’s performance. But how to start? The best approach is to find out where to start optimizing.

It is not always obvious which part of the code makes it so slow, or which of multiple alternatives is fastest. There is the risk to spending a lot of time optimizing the wrong part of the code. Fortunately, there are ways to systematically test how long a computation takes. An easy way is the function system.time. Just wrap your code into this function, and you will (in addition to the actual results of that code) get the time your code took to run.

But that’s not the only route—read on to learn about other techniques as well and see them in action.

Comments closed

Check Those Feature Distributions

Antoine Rebecq shares a warning:

I was recently working on a cool dataset that looked unusually friendly. It was tidy, neat, interesting… the kind of things that you rarely encounter in the wild! My goal was to build a super simple predictor for one of the features. However, I kept getting poor results and at first couldn’t figure out what was happening.

There’s some good, practical advice in there, so check it out. H/T R-Bloggers

Comments closed

Backtesting Options Strategies in R

Holger von Jouanne-Diedrich is in the money:

Options trading strategies are strategies where you combine, often several, derivatives instruments to create a certain risk-return profile (more on that here: Financial Engineering: Static Replication of any Payoff Function). Often we want to know how those strategies would fare in the real world.

The problem is that real data on derivatives are hard to come by and/or very expensive. But we help ourselves with a very good proxy: implied volatility which is freely available for example for many indices. With that, we can use the good old Black-Scholes model to reasonably price options whose strikes are not too far away from the current price of the underlying.

Read on to see how.

Comments closed

Simulating Prediction Intervals

Bryan Shalloway continues a series:

Part 1 of my series of posts on building prediction intervals used data held-out from model training to evaluate the characteristics of prediction intervals. In this post I will use hold-out data to estimate the width of the prediction intervals directly. Doing such can provide more reasonable and flexible intervals compared to analytic approaches.

Click through for the article, and be sure to check out part 1 if you haven’t already.

Comments closed

spkarlyr 1.6 Released

Carly Driggers announces a new release of sparklyr:

Sparklyr, an LF AI & Data Foundation Incubation Project, has released version 1.6! Sparklyr is an R Language package that lets you analyze data in Apache Spark, the well-known engine for big data processing, while using familiar tools in R. The R Language is widely used by data scientists and statisticians around the world and is known for its advanced features in statistical computing and graphics. 

Click through to see the changes.

Comments closed

Working with Prediction Intervals

Bryan Shalloway explains how generating prediction intervals is different from making point predictions:

Before using the model for predictive inference, one should have reviewed overall performance on a holdout dataset to ensure the model is sufficiently accurate for the business context. For example, for our problem is an average error of ~12% and 90% prediction intervals of +/- ~25% of Sale_Price useful? If the answer is “no,” that suggests the need for more effort in improving the accuracy of the model (e.g. trying other transformations, features, model types). For our examples we are assuming the answer is ‘yes,’ our model is accurate enough (so it is appropriate to move-on and focus on prediction intervals).

Click through for the article.

Comments closed

Last Observation Carried Forward in R

Nathan Eastwood shows how to perform Last Observation Carried Forward in R:

Real life data is often riddled with missing values – or NAs – where no data value are stored for the variable in observation. Missing data such as this can have a significant effect on the conclusions which can be drawn from the data. For example individuals dropping out of a study or subjects not properly reporting responses. A common solution to this problem is to fill those NA values with the most recent non-NA value prior to it – this is called the Last Observation Carried Forward (LOCF) method.

This blog post will look at how to implement this particular solution using a combination of {dplyr}, {dbplyr} and {sparklyr}.

Click through for the solution.

Comments closed

Generating Random Numbers in R

Holger von Jouanne-Diedrich brings the noise:

In data science, we try to find, sometimes well-hidden, patterns (= signal) in often seemingly random data (= noise). Pseudo-Random Number Generators (PRNG) try to do the opposite: hiding a deterministic data generating process (= signal) by making it look like randomness (= noise). If you want to understand some basics behind the scenes of this fascinating topic, read on!

Click through for an explanation of the process.

Comments closed

Creating a Rose Chart in R

Neil Saunders takes a look at a classic chart:

I first heard Florence Nightingale and her Geeks Declare War on Death, an episode of the Cautionary Tales podcast, premiered as a special episode of 99% Invisible. It discusses Nightingale’s work as a statistician and in particular, her visualisation of mortality causes in the Crimean War using the famous “rose chart”, or polar area diagram.

I’m sure you’re thinking: how can I explore that using R? 

Read on to find out.

Comments closed

knitr Options and Hooks

The folks at Jumping Rivers conclude a series:

As with many aspects of programming, when you are working by yourself you can be (slightly) more lax with styles and set-up. However, as you start working in a team, different styles can quickly become a hindrance and lead to errors.

Using {knitr} is no different. When you work on documents with different team members, it’s helpful to have a consistent set of settings. If the default for eval changes, this can easily waste time as you try to track down an error. At Jumping Rivers, we use {knitr} a lot. From our training courses, to providing feedback to clients, to constructing monthly reports on clients infrastructure. The great thing about {knitr} is it’s really easy to customise. The bad thing is that without some care, it’s really easy for every member of the team to have different default options. This proliferation of different default options, means that when we pick up someone else document, mistakes may creep in.

Read on for different options they use to keep things consistent.

Comments closed