Press "Enter" to skip to content

Category: R

Working with Trees of Data in R

Martin Stingl shows off the data.tree package:

Lately I tried to visualize an hierarchy with Tableau Desktop. The problem was that the hierarchy had a variable depth because it was tree-based. Each row had an id and a parent_id. Normally hierarchies in Tableau are defined by pulling some fields together, such as product categoryproduct group and product id.

Handling tree-based hierarchies seems to be a lot more complex. I found a plugin at https://github.com/tableau/extension-hierarchy-navigator-sandboxed but this only works online.

So I asked myself how I can handle this using R. I found the R-package data.tree at https://github.com/gluc/data.tree. I want to describe how I use this package to preprocess my data.

Read on to see how this works and how you can turn a classical data representation of a tree (ID and parent ID) into a flattened structure with a fixed number of levels. H/T R-Bloggers.

Comments closed

Inferring Median from a Few Values

Holger von Jouanne-Diedrich is stuck in the middle with you:

Let us dive directly into the matter, the Small Data Rule states:

In a sample of five numerical values from any unknown population, the median of this population lies between the smallest and the largest sample value with 94 percent certainty.

The “population” can be anything, like data about age in a population, income in a country, television consumption, donation amounts, body sizes, temperatures and so on.

This is a very interesting concept. Five values won’t give you the median, but it will give you a bounded expectation with high likelihood. And check out the comments: adding a few more data points increases the expected likelihood even further.

1 Comment

Two Ways to Access Kafka Topics from R

Patrick Neff shows us a couple of ways to build a Kafka-to-R pipeline:

In Data Science projects, we distinguish between descriptive analytics and statistical models running in production. Overall, these can be seen as one process. You start with analyzing historical data to gain insights, find correlations, and finally develop and optimize your model. Then you transfer it and use it in your running system. A key point for every data scientist is not just the mathematical skills themselves, but also how to get the data into your analytics program.

In this blog post, we focus exactly on this crucial step: retrieving the data. In a second article, we’ll talk about running your model on real-time data.

Click through for the techniques.

Comments closed

Font Choices with ggplot2

Kenneth Tay takes us through font options in R’s ggplot2 package:

I was recently asked to convert all the fonts in my ggplot2-generated figures for a paper to Times New Roman. It turns out that this is easy, but it brought up a whole host of questions that I don’t have the full answer to.

If you want to go all out with using custom fonts, I suggest looking into the extrafont and showtext packages. This post will focus on what you can do without importing additional packages.

A quick word of warning: R’s behavior with respect to fonts differs quite a bit between Windows and Mac/Linux. This becomes especially apparent if you do end up installing something like extrafont. H/T R-Bloggers.

Comments closed

Reasons to Use Tidymodels

Roel Hogervorst explains when we may or may not want to use tidymodels versus rolling our own models in R:

When not

you are always using GLM models. (they are very flexible!) it makes no sense to me to go for the extra {parsnip} layer if you are always using the same models. You could still consider using recipes to feature engineer.

– If you are familiar with the kind of data and what models will work on that data. Basically you are an expert on this field and have worked on it for many years. There is no need to experiment.

Read on for concrete examples of when it does make sense. H/T R-Bloggers.

Comments closed

Parallelizing R Code

Mira Celine Klein walks us through some of the basics of parallel code execution in R:

In many cases, your code fulfills multiple independent tasks, for example, if you do a simulation with five different parameter sets. The five processes don’t need to communicate with each other, and they don’t need any result from any other process. They could even be run simultaneously on five different computers… or processor cores. This is called parallelization. Modern desktop computers usually have 16 or more processor cores. To find out how many cores you have on your PC, use the function detectCores(). By default, R uses only one core, but this article tells you how to use multiple cores. If your simulation needs 20 hours to complete with one core, you may get your results within four hours thanks to parallelization!

Read on to see how you can accomplish this, but note that it is operating system-dependent.

Comments closed

Random Forest Feature Importance

Selcuk Disci takes us through an important concept with random forest models:

The random forest algorithms average these results; that is, it reduces the variation by training the different parts of the train set. This increases the performance of the final model, although this situation creates a small increase in bias.

The random forest uses bootstrap aggregating(bagging) algortihms. We would take for training sample, X = x1, …, xn and, Y = y1, …, yn for the outputs. The bagging process repeated B times with selecting a random sample by changing the training set and, tries to fit the relevant tree algorithms to the samples. This fitting function is denoted fb in the below formula.

As far as the article goes, inflation is always and everywhere a monetary phenomenon. H/T R-Bloggers.

Comments closed