Press "Enter" to skip to content

Category: R

Tips for Improving Code Performance in R

Mira Celine Klein continues a series on code performance in R:

This is the second part of our series about code performance in R. It contains a lot of approaches to reduce the time your code needs to run. It’s useful to know those ideas before starting to write new code, but it also helps to optimize existing code.

If you have already written some code you want to speed up, but don’t know which part of it is actually slow, I recommend you to read the first part of this series on profiling. That article also introduces the microbenchmark package which we are going to use to measure code performance in this article.

Let’s start with a seemingly obvious rule, which is however not always easy to follow.

Read on for some tips. H/T R-bloggers.

Comments closed

Writing SQL to Query R data.frames

Tomaz Kastrun tries out a package:

There are many R packages for querying SQL Databases. Recently, I was looking into sqldf package | CRAN documentation.

There are so many great advantages (simple running SQL statements, creating, loading, deleteing data to data.frames, connectivity to many databases, support for SQL functions, data types and many many more) , but one that was really a major win was interactions with data frames and SQL Language.

Between sqldf and dbplyr, you get it both ways: treat a data.frame like a SQL table, or treat a SQL database like R data.frames.

Comments closed

Plotting XGBoost Trees with R

Andrew Treadway shows off a method to visualize the results of training an XGBoost model:

In this post, we’re going to cover how to plot XGBoost trees in R. XGBoost is a very popular machine learning algorithm, which is frequently used in Kaggle competitions and has many practical use cases.

Let’s start by loading the packages we’ll need. Note that plotting XGBoost trees requires the DiagrammeR package to be installed, so even if you have xgboost installed already, you’ll need to make sure you have DiagrammeR also.

Click through for the process. H/T R-Bloggers.

Comments closed

Random Sequences and Probabilities

Holger von Jouanne-Diedrich explains the results of a poll:

Some time ago I conducted a poll on LinkedIn that quickly went viral. I asked which of three different coin tossing sequences were more likely and I received exactly 1,592 votes! Nearly 48,000 people viewed it and more than 80 comments are under the post (you need a LinkedIn account to fully see it here: LinkedIn Coin Tossing Poll).

In this post I will give the solution with some background explanation, so read on!

Read on to understand why it’s just as likely that you’ll see a sequence, when flipping a coin, of H,H,H,H,H,H just as often as you’ll see H,T,H,T,H,T.

Comments closed

Troubleshooting Code Performance in R

Mira Celine Klein shows how to benchmark R code performance:

Let’s assume you have written some code, it’s working, it computes the results you need, but it is really slow. If you don’t want to get slowed down in your work, you have no other choice than improving the code’s performance. But how to start? The best approach is to find out where to start optimizing.

It is not always obvious which part of the code makes it so slow, or which of multiple alternatives is fastest. There is the risk to spending a lot of time optimizing the wrong part of the code. Fortunately, there are ways to systematically test how long a computation takes. An easy way is the function system.time. Just wrap your code into this function, and you will (in addition to the actual results of that code) get the time your code took to run.

But that’s not the only route—read on to learn about other techniques as well and see them in action.

Comments closed

Check Those Feature Distributions

Antoine Rebecq shares a warning:

I was recently working on a cool dataset that looked unusually friendly. It was tidy, neat, interesting… the kind of things that you rarely encounter in the wild! My goal was to build a super simple predictor for one of the features. However, I kept getting poor results and at first couldn’t figure out what was happening.

There’s some good, practical advice in there, so check it out. H/T R-Bloggers

Comments closed

Backtesting Options Strategies in R

Holger von Jouanne-Diedrich is in the money:

Options trading strategies are strategies where you combine, often several, derivatives instruments to create a certain risk-return profile (more on that here: Financial Engineering: Static Replication of any Payoff Function). Often we want to know how those strategies would fare in the real world.

The problem is that real data on derivatives are hard to come by and/or very expensive. But we help ourselves with a very good proxy: implied volatility which is freely available for example for many indices. With that, we can use the good old Black-Scholes model to reasonably price options whose strikes are not too far away from the current price of the underlying.

Read on to see how.

Comments closed

Simulating Prediction Intervals

Bryan Shalloway continues a series:

Part 1 of my series of posts on building prediction intervals used data held-out from model training to evaluate the characteristics of prediction intervals. In this post I will use hold-out data to estimate the width of the prediction intervals directly. Doing such can provide more reasonable and flexible intervals compared to analytic approaches.

Click through for the article, and be sure to check out part 1 if you haven’t already.

Comments closed

spkarlyr 1.6 Released

Carly Driggers announces a new release of sparklyr:

Sparklyr, an LF AI & Data Foundation Incubation Project, has released version 1.6! Sparklyr is an R Language package that lets you analyze data in Apache Spark, the well-known engine for big data processing, while using familiar tools in R. The R Language is widely used by data scientists and statisticians around the world and is known for its advanced features in statistical computing and graphics. 

Click through to see the changes.

Comments closed

Working with Prediction Intervals

Bryan Shalloway explains how generating prediction intervals is different from making point predictions:

Before using the model for predictive inference, one should have reviewed overall performance on a holdout dataset to ensure the model is sufficiently accurate for the business context. For example, for our problem is an average error of ~12% and 90% prediction intervals of +/- ~25% of Sale_Price useful? If the answer is “no,” that suggests the need for more effort in improving the accuracy of the model (e.g. trying other transformations, features, model types). For our examples we are assuming the answer is ‘yes,’ our model is accurate enough (so it is appropriate to move-on and focus on prediction intervals).

Click through for the article.

Comments closed