R – Page 52 – Curated SQL

Writing large datasets to SQL Server can be very slow using the DBI package with an odbc connection. The issue with writing data is that individual INSERT statements are generated for each row of data. I’ve also had issues with remote connections that can make large writes to SQL Server take a very long time. SQL Server Management Studio does provide a GUI interface to import data that is much more efficient. For those that want to include the data import in their reproducible R workflows there are a couple of options.

Read on to see how it works. It’s still calling bcp.exe under the covers, so expect similar foibles using it as you would bcp. H/T R-Bloggers.

Comments closed

Performance Tips when Working with Large Datasets in R

Published 2021-08-06 by Kevin Feasel

Mira Celine Klein continues a series on performance tuning R code:

Whether your dataset is “large” not only depends on the number of rows, but also on the method you are going to use. It’s easy to compute the mean or sum of as many as 10,000 numbers, but a nonlinear regression with many variables can already take some time with a sample size of 1,000.
Sometimes it may help to parallelize (see part 3 of the series). But with large datasets, you can use parallelization only up to the point where working memory becomes the limiting factor. In addition, there may be tasks that cannot be parallelized at all. In these cases, the strategies from part 2 of this series may be helpful, and there are some more ways:

Click through for four options.

Comments closed

Caching Function Results in an R Package

Published 2021-08-03 by Kevin Feasel

Maelle Salmon and Cristophe Dervieux show us ways to cache results of function calls using R:

Caching means that if you call a function several times with the exact same input, the function is only actually run the first time. The result is stored in a cache of some sort (more practical details later!). Every other time the function is called with the same input, the result is retrieved from the cache unless invalidated. You will often think of caching as something valid in only one R session, but we’ll see it can be persistent across sessions via storage on disk.

As a quick note, this makes sense when writing functions, which are expressions without side effects. If you have side effects, caching might not give you what you expect.

Comments closed

Working with Trees of Data in R

Published 2021-08-03 by Kevin Feasel

Martin Stingl shows off the data.tree package:

Lately I tried to visualize an hierarchy with Tableau Desktop. The problem was that the hierarchy had a variable depth because it was tree-based. Each row had an id and a parent_id. Normally hierarchies in Tableau are defined by pulling some fields together, such as product category, product group and product id.
Handling tree-based hierarchies seems to be a lot more complex. I found a plugin at https://github.com/tableau/extension-hierarchy-navigator-sandboxed but this only works online.
So I asked myself how I can handle this using R. I found the R-package data.tree at https://github.com/gluc/data.tree. I want to describe how I use this package to preprocess my data.

Read on to see how this works and how you can turn a classical data representation of a tree (ID and parent ID) into a flattened structure with a fixed number of levels. H/T R-Bloggers.

Comments closed

Inferring Median from a Few Values

Published 2021-07-28 by Kevin Feasel

Holger von Jouanne-Diedrich is stuck in the middle with you:

Let us dive directly into the matter, the Small Data Rule states:
In a sample of five numerical values from any unknown population, the median of this population lies between the smallest and the largest sample value with 94 percent certainty.
The “population” can be anything, like data about age in a population, income in a country, television consumption, donation amounts, body sizes, temperatures and so on.

This is a very interesting concept. Five values won’t give you the median, but it will give you a bounded expectation with high likelihood. And check out the comments: adding a few more data points increases the expected likelihood even further.

1 Comment

Two Ways to Access Kafka Topics from R

Published 2021-07-21 by Kevin Feasel

Patrick Neff shows us a couple of ways to build a Kafka-to-R pipeline:

In Data Science projects, we distinguish between descriptive analytics and statistical models running in production. Overall, these can be seen as one process. You start with analyzing historical data to gain insights, find correlations, and finally develop and optimize your model. Then you transfer it and use it in your running system. A key point for every data scientist is not just the mathematical skills themselves, but also how to get the data into your analytics program.
In this blog post, we focus exactly on this crucial step: retrieving the data. In a second article, we’ll talk about running your model on real-time data.

Click through for the techniques.

Comments closed

Euler’s Equation in R

Published 2021-07-16 by Kevin Feasel

Holger von Jouanne-Diedrich takes us through Euler’s equation:

In this post, we will first give some intuition for and then demonstrate what is often called the most beautiful formula in mathematics, Euler’s identity, in R – first numerically with base R and then also symbolically, so read on!

Do check it out, even if the term “Euler’s equation” means nothing to you.

Comments closed

Designing Colorful Line Graphs in ggplot2

Published 2021-07-16 by Kevin Feasel

Tomaz Kastrun has fun with colors:

How about some colours in line graph?
Or even more wacky? Nevertheless, let’s create a function that generates some sample “mocked” data and draws a line chart:

The outputs look a lot like waveforms of spoken language, so you’ve got that going for you.

Comments closed

Checking the Weather with R

Published 2021-07-12 by Kevin Feasel

Tomaz Kastrun wants to know if it’s raining outside:

Besides looking on the phone, on the web, or sticking the head out of the window to check the weather, a useless way to do it, is to write function that will tell you just that. I know, oh… the absurdity. Nevertheless, the function is just one of many possibilities:

Click through for the code.

Comments closed

Font Choices with ggplot2

Published 2021-07-09 by Kevin Feasel

Kenneth Tay takes us through font options in R’s ggplot2 package:

I was recently asked to convert all the fonts in my ggplot2-generated figures for a paper to Times New Roman. It turns out that this is easy, but it brought up a whole host of questions that I don’t have the full answer to.
If you want to go all out with using custom fonts, I suggest looking into the extrafont and showtext packages. This post will focus on what you can do without importing additional packages.

A quick word of warning: R’s behavior with respect to fonts differs quite a bit between Windows and Mac/Linux. This becomes especially apparent if you do end up installing something like extrafont. H/T R-Bloggers.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: R

BCP from R into SQL Server

Performance Tips when Working with Large Datasets in R

Caching Function Results in an R Package

Working with Trees of Data in R

Inferring Median from a Few Values

Two Ways to Access Kafka Topics from R

Euler’s Equation in R

Designing Colorful Line Graphs in ggplot2

Checking the Weather with R

Font Choices with ggplot2