Press "Enter" to skip to content

Category: R

Simulating Prediction Intervals

Bryan Shalloway continues a series:

Part 1 of my series of posts on building prediction intervals used data held-out from model training to evaluate the characteristics of prediction intervals. In this post I will use hold-out data to estimate the width of the prediction intervals directly. Doing such can provide more reasonable and flexible intervals compared to analytic approaches.

Click through for the article, and be sure to check out part 1 if you haven’t already.

Comments closed

spkarlyr 1.6 Released

Carly Driggers announces a new release of sparklyr:

Sparklyr, an LF AI & Data Foundation Incubation Project, has released version 1.6! Sparklyr is an R Language package that lets you analyze data in Apache Spark, the well-known engine for big data processing, while using familiar tools in R. The R Language is widely used by data scientists and statisticians around the world and is known for its advanced features in statistical computing and graphics. 

Click through to see the changes.

Comments closed

Working with Prediction Intervals

Bryan Shalloway explains how generating prediction intervals is different from making point predictions:

Before using the model for predictive inference, one should have reviewed overall performance on a holdout dataset to ensure the model is sufficiently accurate for the business context. For example, for our problem is an average error of ~12% and 90% prediction intervals of +/- ~25% of Sale_Price useful? If the answer is “no,” that suggests the need for more effort in improving the accuracy of the model (e.g. trying other transformations, features, model types). For our examples we are assuming the answer is ‘yes,’ our model is accurate enough (so it is appropriate to move-on and focus on prediction intervals).

Click through for the article.

Comments closed

Last Observation Carried Forward in R

Nathan Eastwood shows how to perform Last Observation Carried Forward in R:

Real life data is often riddled with missing values – or NAs – where no data value are stored for the variable in observation. Missing data such as this can have a significant effect on the conclusions which can be drawn from the data. For example individuals dropping out of a study or subjects not properly reporting responses. A common solution to this problem is to fill those NA values with the most recent non-NA value prior to it – this is called the Last Observation Carried Forward (LOCF) method.

This blog post will look at how to implement this particular solution using a combination of {dplyr}, {dbplyr} and {sparklyr}.

Click through for the solution.

Comments closed

Generating Random Numbers in R

Holger von Jouanne-Diedrich brings the noise:

In data science, we try to find, sometimes well-hidden, patterns (= signal) in often seemingly random data (= noise). Pseudo-Random Number Generators (PRNG) try to do the opposite: hiding a deterministic data generating process (= signal) by making it look like randomness (= noise). If you want to understand some basics behind the scenes of this fascinating topic, read on!

Click through for an explanation of the process.

Comments closed

Creating a Rose Chart in R

Neil Saunders takes a look at a classic chart:

I first heard Florence Nightingale and her Geeks Declare War on Death, an episode of the Cautionary Tales podcast, premiered as a special episode of 99% Invisible. It discusses Nightingale’s work as a statistician and in particular, her visualisation of mortality causes in the Crimean War using the famous “rose chart”, or polar area diagram.

I’m sure you’re thinking: how can I explore that using R? 

Read on to find out.

Comments closed

knitr Options and Hooks

The folks at Jumping Rivers conclude a series:

As with many aspects of programming, when you are working by yourself you can be (slightly) more lax with styles and set-up. However, as you start working in a team, different styles can quickly become a hindrance and lead to errors.

Using {knitr} is no different. When you work on documents with different team members, it’s helpful to have a consistent set of settings. If the default for eval changes, this can easily waste time as you try to track down an error. At Jumping Rivers, we use {knitr} a lot. From our training courses, to providing feedback to clients, to constructing monthly reports on clients infrastructure. The great thing about {knitr} is it’s really easy to customise. The bad thing is that without some care, it’s really easy for every member of the team to have different default options. This proliferation of different default options, means that when we pick up someone else document, mistakes may creep in.

Read on for different options they use to keep things consistent.

Comments closed

Fuzzy Joins in SQL Server using R

Rajendra Gupta shows how you can use R in SQL Server Machine Learning Services to perform fuzzy joins:

Suppose you have a web page where users right comments in the text box. You are performing data analysis. However, there are few spelling mistakes, and you want to perform the approximate match or fuzzy lookup in another dataset. Similarly, you have a product catalog database. Your users search for a product; however, they might not type the exact keyword for the product name. Using the fuzzy joins, we can return the user the products with an approximate match to the product names.

SQL Server Machine Learning using R scripts enables you to execute the R language queries inside the SQL Server. In the previous articles, we explored a few use-cases of the machine learning language. In the previous articles, we explored the R scripts for the below topics.

It’s R, so there’s already a package in CRAN for that.

Comments closed

k-gram Language Models in R

Valerio Gherardi takes us through the concept of k-grams:

The post is structured as follows: we start by giving a succinct theoretical introduction to kk-gram models. Subsequently, we illustrate how to train a kk-gram model in R using kgrams, and explain how to use the standard perplexity metric for model evaluation or tuning. Finally, we use our trained model to generate some random text at different temperatures.

This goes into some depth on the topic and is worth giving a careful read.

Comments closed

The Basics of k-Means Clustering

Nathaniel Schmucker explains some of the principles of k-means clustering:

k-Means is easy to implement. In R, you can use the function kmeans() to quickly deploy an efficient k-Means algorithm. On datasets of reasonable size (thousands of rows), the kmeans function runs in fractions of a second.

k-Means is easy to interpret (in 2 dimensions). If you have two features of your k-Means analysis (e.g., you are grouping by length and width), the result of the k-Means algorithm can be plotted on an xy-coordinate system to show the extent of each cluster. It’s easy to visually inspect the assignment to see if the k-Means analysis returned a meaningful insight. In more dimensions (e.g., length, width, and height) you will need to either create a 3D plot, summarize your features in a table, or find another alternative to describing your analysis. This loses the intuitive power that a 2D k-Means analysis has in convincing you or your audience that your analysis should be trusted. It’s not to say that your analysis is wrong; it simply takes more mental focus to understand what your analysis says.

The k-Means analysis, however, is not always the best choice. k-Means does well on data that naturally falls into spherical clusters. If your data has a different shape (linear, spiral, etc.), k-Means will force clustering into circles, which can result in outputs that defy human expectations. The algorithm is not wrong; we have fed the algorithm data it was never intended to understand.

There’s a lot of depth in this article which makes it really interesting.

Comments closed