Press "Enter" to skip to content

Category: R

Two Ways to Access Kafka Topics from R

Patrick Neff shows us a couple of ways to build a Kafka-to-R pipeline:

In Data Science projects, we distinguish between descriptive analytics and statistical models running in production. Overall, these can be seen as one process. You start with analyzing historical data to gain insights, find correlations, and finally develop and optimize your model. Then you transfer it and use it in your running system. A key point for every data scientist is not just the mathematical skills themselves, but also how to get the data into your analytics program.

In this blog post, we focus exactly on this crucial step: retrieving the data. In a second article, we’ll talk about running your model on real-time data.

Click through for the techniques.

Comments closed

Font Choices with ggplot2

Kenneth Tay takes us through font options in R’s ggplot2 package:

I was recently asked to convert all the fonts in my ggplot2-generated figures for a paper to Times New Roman. It turns out that this is easy, but it brought up a whole host of questions that I don’t have the full answer to.

If you want to go all out with using custom fonts, I suggest looking into the extrafont and showtext packages. This post will focus on what you can do without importing additional packages.

A quick word of warning: R’s behavior with respect to fonts differs quite a bit between Windows and Mac/Linux. This becomes especially apparent if you do end up installing something like extrafont. H/T R-Bloggers.

Comments closed

Reasons to Use Tidymodels

Roel Hogervorst explains when we may or may not want to use tidymodels versus rolling our own models in R:

When not

you are always using GLM models. (they are very flexible!) it makes no sense to me to go for the extra {parsnip} layer if you are always using the same models. You could still consider using recipes to feature engineer.

– If you are familiar with the kind of data and what models will work on that data. Basically you are an expert on this field and have worked on it for many years. There is no need to experiment.

Read on for concrete examples of when it does make sense. H/T R-Bloggers.

Comments closed

Parallelizing R Code

Mira Celine Klein walks us through some of the basics of parallel code execution in R:

In many cases, your code fulfills multiple independent tasks, for example, if you do a simulation with five different parameter sets. The five processes don’t need to communicate with each other, and they don’t need any result from any other process. They could even be run simultaneously on five different computers… or processor cores. This is called parallelization. Modern desktop computers usually have 16 or more processor cores. To find out how many cores you have on your PC, use the function detectCores(). By default, R uses only one core, but this article tells you how to use multiple cores. If your simulation needs 20 hours to complete with one core, you may get your results within four hours thanks to parallelization!

Read on to see how you can accomplish this, but note that it is operating system-dependent.

Comments closed

Random Forest Feature Importance

Selcuk Disci takes us through an important concept with random forest models:

The random forest algorithms average these results; that is, it reduces the variation by training the different parts of the train set. This increases the performance of the final model, although this situation creates a small increase in bias.

The random forest uses bootstrap aggregating(bagging) algortihms. We would take for training sample, X = x1, …, xn and, Y = y1, …, yn for the outputs. The bagging process repeated B times with selecting a random sample by changing the training set and, tries to fit the relevant tree algorithms to the samples. This fitting function is denoted fb in the below formula.

As far as the article goes, inflation is always and everywhere a monetary phenomenon. H/T R-Bloggers.

Comments closed

The Future of R with SQL Server

James Rowland-Jones has an update for us:

The importance of R was first recognized by the SQL Server team back in 2016 with the launch of SQL ML Services and R Server. Over the years we have added Python to SQL ML Services in 2017 and Java support through our language extensions in 2019. Earlier this year we also announced the general availability of SQL ML Services into Azure SQL Managed Instance. SparkR, sparklyr, and PySpark are also available as part of SQL Server Big Data Clusters. We remain committed to R.

With that said, much has changed in the world of data science and analytics since 2016. Microsoft’s approach to open-source software has undergone a similar transformation in the same period. It is therefore time for us to share how we, in Azure SQL and SQL Server, are changing to meet the needs of our users and the R community moving forward.

I never used ML Server (but have used SQL Server ML Services a lot), so that part of the announcement doesn’t affect me and I’m not sure how many organizations it does affect. Switching to CRAN R is a good idea and I appreciate that they’re open-sourcing the RevoScaleR and revoscalepy code bases. The one thing I’d really like to see in vNext’s Machine Learning Services is an easy way to update the version of R

1 Comment

Using ggplot2 to Create a Faceted Histogram plus Curve

Sebastian Sauer builds a combo chart:

Overlaying a histogram (possibly facetted) is not something far fetched when analyzing data. Surprisingly, it appears (to the best of my knowledge) that there’s no comfortable out-of-the-box solution in ggplot2, although it can be of course achieved with some lines of code. Here’s my take.

Click through for Sebastian’s version, as well as information on the ggh4x library.

Comments closed