Press "Enter" to skip to content

Category: R

Faster User-Defined Functions In SparkR

Liang Zhang and Hossein Falaki note a major performance improvement for functions in SparkR using the latest version of the Databricks Runtime:

SparkR offers four APIs that run a user-defined function in R to a SparkDataFrame

  • dapply()
  • dapplyCollect()
  • gapply()
  • gapplyCollect()

dapply() allows you to run an R function on each partition of the SparkDataFrame and returns the result as a new SparkDataFrame, on which you may apply other transformations or actions. gapply() allows you to apply a function to each grouped partition consisting of a key and the corresponding rows in a SparkDataFrame. dapplyCollect() and gapplyCollect()are shortcuts if you want to call collect() on the result.

The following diagram illustrates the serialization and deserialization performed during the execution of the UDF. The data gets serialized twice and deserialized twice in total, all of which are row-wise.

By vectorizing data serialization and deserialization in Databricks Runtime 4.3, we encode and decode all the values of a column at once. This eliminates the primary bottleneck which row-wise serialization, and significantly improves SparkR’s UDF performance. Also, the benefit from the vectorization is more drastic for larger datasets.

It looks like they get some pretty serious gains from this change.

Comments closed

Subsetting Matrices In R

Dave Mason continues his look at matrices in R:

We can extract an entire row from a matrix. To do this, specify the desired row only within the square brackets [ ]. The placeholder where you would otherwise specify the column is left empty.

> #Points scored by Kendrick Perkins.
> points_scored_by_quarter[1,]
1st 2nd 3rd 4th 2 2 6 0
> points_scored_by_quarter["Perkins",]
1st 2nd 3rd 4th 2 2 6 0 

Conversely, we can extract a column from a matrix. Specify the column within the square brackets [ ]and omit the row. The result is a vector, thus the pivot effect–the row names are displayed in the output (not the column name).

Dave points out that working with matrices is basically an extension of working with vectors.

Comments closed

Microsoft R Open 3.5.1

David Smith announces Microsoft R Open 3.5.1:

Microsoft R Open 3.5.1 has been released, combining the latest R language engine with multi-processor performance and tools for managing R packages reproducibly. You can download Microsoft R Open 3.5.1 for Windows, Mac and Linux from MRAN now. Microsoft R Open is 100% compatible with all R scripts and packages, and works with all your favorite R interfaces and development environments.

This update brings a number of minor fixes to the R language engine from the R core team. It also makes available a host of new R packages contributed by the community, including packages for downloading financial data, connecting with analytics systems, applying machine learning algorithms and statistical models, and many more. New R packages are released every day, and you can access packages released after the 1 August 2018 CRAN snapshot used by MRO 3.5.1 using the checkpoint package.

Read on for more and check out the updates.

Comments closed

Performing Linear Regression With Power BI

Jason Cantrell shows how to create a simple linear regression in Power BI:

Linear Regression is a very useful statistical tool that helps us understand the relationship between variables and the effects they have on each other. It can be used across many industries in a variety of ways – from spurring value to gaining customer insight – to benefit business.

The Simple Linear Regression model allows us to summarize and examine relationships between two variables. It uses a single independent variable and a single dependent variable and finds a linear function that predicts the dependent variable values as a function of the independent variables.

If you want real linear regression, drop in an R or Python script.

Comments closed

Principal Component Analysis With Faces

Mic at The Beginner Programmer shows us how to creepy PCA diagrams with human faces:

PCA looks for a new the reference system to describe your data. This new reference system is designed in such a way to maximize the variance of the data across the new axis. The first principal component accounts for as much variance as possible, as does the second and so on. PCA transforms a set of (tipically) correlated variables into a set of uncorrelated variables called principal components. By design, each principal component will account for as much variance as possible. The hope is that a fewer number of PCs can be used to summarise the whole dataset. Note that PCs are a linear combination of the original data.

The procedure simply boils down to the following steps

  1. Scale (normalize) the data (not necessary but suggested especially when variables are not homogeneous).

  2. Calculate the covariance matrix of the data.

  3. Calculate eigenvectors (also, perhaps confusingly, called “loadings”) and eigenvalues of the covariance matrix.

  4. Choose only the first N biggest eigenvalues according to one of the many criteria available in the literature.

  5. Project your data in the new frame of reference by multipliying your data matrix by a matrix whose columns are the N eigenvectors associated with the N biggest eigenvalues.

  6. Use the projected data (very confusingly called “scores”) as your new variables for further analysis.

I like the explanations provided, and the data set is definitely something I’m not used to seeing with PCA.  H/T R-bloggers

Comments closed

Sorting With data.table Versus dplyr

John Mount shows us that data.table is way faster for sorting than dplyr‘s arrange function:

Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table uses a radix sort which has the potential to perform in near linear time (faster than the n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).

In fact, if we divide the y in the above graph by log(rows) we get something approaching a constant.

John has also provided us with a markdown document for comparison.

Comments closed

Matrices In R

Dave Mason continues his perusal of R data types, this time looking at the matrix:

All of the examples so far have consisted of matrices with data elements of the same class. And for good reason: it’s a requirement for a matrix. R will coerce elements with mismatched classes to the same class. Here are two vectors, one of class integer and the other of class character. After combining them into a matrix via rbind(), we see the first row of data elements are of the character class (enclosed in double quotes):

> row1 <- c(1L, 2L, 3L, 4L)
> row2 <- c("a", "b", "c", "d")
> new_matrix <- rbind(row1, row2)
> new_matrix [,1] [,2] [,3] [,4]
row1 "1" "2" "3" "4"
row2 "a" "b" "c" "d"

Matrices drive a large number of statistical techniques, though I tend to deal with them less directly than I would have imagined.

Comments closed

Binning And Recoding In R

Sebastian Sauer shows a few methods of practical data reshaping in R:

Recoding means changing the levels of a variable, for instance changing “1” to “woman” and “2” to “man”. Binning means aggregating several variable levels to one, for instance aggregating the values From “1.00 meter” to “1.60 meter” to “small_size”.

Both operations are frequently necessary in practical data analysis. In this post, we review some methods to accomplish these two tasks.

Click through for examples of techniques you can use.

Comments closed

Working With Vectors In R

Dave Mason continues his quest to learn R, focusing on vectors.  First, he looks at vector-based mathematical operations:

Now we can determine the number of customers gained vs number of customers lost (plus/minus) for each month of the quarter by subtracting one vector from another. Each vector has the same number of elements (three), and the result is also a vector of three elements:

> net_customer_gain <- new_customers - customers_lost
> net_customer_gain
Jan Feb Mar
-15 30 3 

The sum() function can be used to add up all the elements of a vector. Below, we get the total number of new customers and lost customers for the first quarter:

> sum(new_customers)
[1] 270
> sum(customers_lost)
[1] 252

Then he shows off subsetting in vectors:

To extract multiple elements from a vector, pass in an integer class vector to the square brackets. The values of the integer vector correspond to the elements to be extracted. Here we will extract the first, third, and fourth elements of the jersey_numbers vector:

> jersey_numbers[c(1,3,4)]
Pierce Rondo Allen 34 9 20 

The values of the integer vector can be in any order:

> jersey_numbers[c(4,1,3)] Allen Pierce Rondo 20 34 9

Vectors are a critical part of understanding R.

Comments closed

The Problem With Meta-Packages

John Mount has a critique of meta-packages:

Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.

I’m not really sold on Jones’s argument, but I do think Mount has a good point.

Comments closed