RevoScaleR

Kevin Feasel

2017-02-07

R

Tomaz Kastrun explains how the RevoScaleR package is useful:

RevoScaleR package and computational function were designed for parallel computation with no memory limitation, mainly because this package introduced it’s own file format, called XDF. eXternal Data Frame was designed for fast processing of smaller chunks of data, and gains it’s efficiency when reading and writing the XDF data by loading chucks of data into RAM one by at a time and only what is needed. The way this is done, means no limitations for the size of RAM, computations run much faster (because it is using C++ to write these algorithms, which is faster than original, which were written in interpretative language). Data scientist still make a single R call, bur R will use distrubuteR component to determine, how many cores, sockets and threads are available and then launch smaller portion of load into each thread, analyze data a bit at a time. With XDF, data is retrieved many times, but since it is 5-10times smaller (as I have already shown in previous blog posts when compared to *.txt or *.csv files), and it is written and stored into XDF file the same way as it was extracted from the memory, it enables faster computations, because no parsing of data chunks is required and because of the way, how data is stored, is maximizes the retrieval time of the data.

If you’re using SQL Server R Services, these rx functions will become very important to you.

Related Posts

Using Cohen’s D for Experiments

Nina Zumel takes us through Cohen’s D, a useful tool for determining effect sizes in experiments: Cohen’s d is a measure of effect size for the difference of two means that takes the variance of the population into account. It’s defined asd = | μ1 – μ2 | / σpooledwhere σpooled is the pooled standard deviation over both cohorts. […]

Read More

Comparing Iterator Performance in R

Ulrik Stervbo has a performance comparison for for, apply, and map functions in R: It is usually said, that for– and while-loops should be avoided in R. I was curious about just how the different alternatives compare in terms of speed. The first loop is perhaps the worst I can think of – the return vector is […]

Read More

Categories

February 2017
MTWTFSS
« Jan Mar »
 12345
6789101112
13141516171819
20212223242526
2728