Category: R

Let’s work an example in R. Suppose we are working with a linear regression model and from our donor system we have extracted the following representation of the model as “intercept” and “betas”.
intercept <- 3 betas <- c(weight = 2, height = 4)
Our goal is to build a linear regression model that has the above coefficients. The way we are going to do this is by building our own synthetic data set such that the regression fit through this data set yields these coefficients.

It’s fairly straightforward to do this for linear models; as things get more complicated, however, the difficulty level spikes.

Comments closed

Microsoft’s R Roadmap

Published 2019-07-05 by Kevin Feasel

David Smith has a review of Microsoft’s R roadmap, focusing on Azure:

The post references this guide to the machine learning services in Azure, along with their supported languages. Services that currently support R include Azure Machine Learning Studio, SQL Server Microsoft Machine Learning Service, Microsoft Machine Learning Server, Azure Data Science Virtual Machine, Azure Databricks, and more.

David links to this strategy post:

The R and Python programming languages are primary citizens for data science on the Azure AI Platform. These are the most common languages for performing data preparation, transformation, training and operationalization of machine learning models; the core components for one’s digital transformation leveraging AI. Yet they are fundamentally different in many aspects, directly affecting not only deployed solutions IT architectures but also but also corporate strategies for developer skills and product supportability.

This series of articles is designed help you understand the options your company and customers have to support and evolve their R strategy.

It’s good to see some of this out in the open for planning purposes.

Comments closed

Using data.table to Add Aggregate Values to Data Frames

Published 2019-07-03 by Kevin Feasel

John Mount shows how you can combine := and by in the data.table package to add a new column with the results of an aggregation in R:

The “by” signals we are doing a per-group calculation, and the “:=” signals to land the results in the original data.table. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row.

It’s worth reading up on data.table if you aren’t familiar with the great things it can do.

Comments closed

Performing Row-Wise Operations with pmap

Published 2019-07-03 by Kevin Feasel

Sebastian Sauer shows how you can use pmap in the purrr library to perform row-wise aggregations:

Rowwwise operations are a quite frequent operations in data analysis. The R language environment is particularly strong in column wise operations. This is due to technical reasons, as data frames are internally built as column-by-column structures, hence column wise operations are simple, rowwise more difficult.
This post looks at some rather general way to comput rowwise statistics. Of course, numerous ways exist and there are quite a few tutorials around, notably by Jenny Bryant, and by Emil Hvitfeldt to name a few.

The ideal solution is to have your data be properly columnar, but if you’re in a pinch, it’s good to know that you can do this.

Comments closed

Random Forest on Small Numbers of Observations

Published 2019-06-27 by Kevin Feasel

Neil Saunders takes us through an interesting problem:

A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient?
The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem.

Neil decided to explore the problem further regardless and came to some interesting conclusions.

Comments closed

Using Cohen’s D for Experiments

Published 2019-06-21 by Kevin Feasel

Nina Zumel takes us through Cohen’s D, a useful tool for determining effect sizes in experiments:

Cohen’s d is a measure of effect size for the difference of two means that takes the variance of the population into account. It’s defined as
d = | μ₁ – μ₂ | / σ_pooled
where σ_pooled is the pooled standard deviation over both cohorts.

Read the whole thing.

Comments closed

Comparing Iterator Performance in R

Published 2019-06-21 by Kevin Feasel

Ulrik Stervbo has a performance comparison for for, apply, and map functions in R:

It is usually said, that for– and while-loops should be avoided in R. I was curious about just how the different alternatives compare in terms of speed.
The first loop is perhaps the worst I can think of – the return vector is initialized without type and length so that the memory is constantly being allocated.

The performance of map isn’t great, though the benefits to me are less about performance and more about readability. H/T R-bloggers

Comments closed

Visualizing with Heatmaps in R

Published 2019-06-17 by Kevin Feasel

Anisa Dhana shows how you can create a quick heatmap plot in R:

To give your own colors use the scale_fill_gradientn function.
ggplot(dat, aes(Age, Race)) +
geom_raster(aes(fill = BMI)) +
scale_fill_gradientn(colours=c("white", "red"))

This is a quick example using ggplot2 but there are other heatmap libraries available too.

Comments closed

Predicting Intermittent Demand

Published 2019-06-13 by Kevin Feasel

Bruno Rodrigues shows one technique for forecasting intermittent data:

Now, it is clear that this will be tricky to forecast. There is no discernible pattern, no trend, no seasonality… nothing that would make it “easy” for a model to learn how to forecast such data.
This is typical intermittent demand data. Specific methods have been developed to forecast such data, the most well-known being Croston, as detailed in this paper. A function to estimate such models is available in the {tsintermittent} package, written by Nikolaos Kourentzes who also wrote another package, {nnfor}, which uses Neural Networks to forecast time series data. I am going to use both to try to forecast the intermittent demand for the {RDieHarder} package for the year 2019.

Read the whole thing. H/T R-Bloggers

Comments closed

Learning R Versus Python

Published 2019-06-12 by Kevin Feasel

Andy Kirk shares the results of a rather informal Twitter poll:

Yesterday I ran a simple Twitter poll about the relative ease of learning R vs. Python. Although a correct answer to this query will ALWAYS have to be based on nuances like pre-existing skills and the scope of need, this originates from people telling me they encounter job or career profiles that list a need for R and/or Python. If they don’t have either, if they prioritised the pursuit of just one, which would be possible to develop a degree of competency more easily, more quickly and more efficiently?

Andy has also created a Twitter moment from the responses.

My thought, based only on the question itself, is that R would be better than Python because the hypothetical person has no additional programming skills. For someone with additional programming skills, the breakdown for me starts with, if your background is statistics, database development, or functional programming, you probably want R; if your background is object-oriented development or imperative programming, you probably want Python. And then it gets nuanced.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31