Press "Enter" to skip to content

Category: R

Sympathy For The Part-Timer

John Mount wants us to think about part-time users:

The second point I think is particularly interesting. It means:

An R user who does not consider themselves an expert programmer could be maintaining code that they understand, but could not be expected to create from scratch.

Or:

Let’s have some sympathy for the part-time R user.

This is the point we will emphasize in our new example.

Read on for a particular example.  I think this is good advice to generalize:  write your code to make it as easy as possible for “part-time” users.  This applies to custom code you write as well, as unless you are constantly in a particular part of the code base, you’ll forget the details later and have the same problems that a part-timer would have working with a different language.

Comments closed

Explaining Singular Value Decomposition

Tim Bock explains how Singular Value Decomposition works:

The table above is a matrix of numbers. I am going to call it Z. The singular value decomposition is computed using the svd function. The following code computes the singular value decomposition of the matrix Z, and assigns it to a new object called SVD, which contains one vector, d, and two matrices, u and v. The vector, d, contains the singular values. The first matrix, u, contains the left singular vectors, and vcontains the right singular vectors. The left singular vectors represent the rows of the input table, and the right singular vectors represent their columns.

Tim includes R scripts to follow along, and for this topic I definitely recommend following along.

Comments closed

A New ODBC Package For R

David Smith looks at the odbc package in R:

The odbc package is a from-the-ground-up implementation of an ODBC interface for R that provides native support for additional data types (including dates, timestamps, raw binary, and 64-bit integers) and parameterized queries. The odbc package provides connections with any ODBC-compliant database, and has been comprehensively tested on SQL Server, PostgreSQL and MySQL. Benchmarks show that it’s also somewhat faster than RODBC: 3.2 times faster for reads, and 1.9 times faster for writes.

Sounds like odbc lets you run ad hoc queries and also lets you use dplyr as an ORM, similar to using Linq in C#.

Comments closed

sparklyr 0.6 Released

Javier Luraschi announces sparklyr 0.6:

We’re excited to announce a new release of the sparklyr package, available in CRAN today! sparklyr 0.6 introduces new features to:

  • Distribute R computations using spark_apply() to execute arbitrary R code across your Spark cluster. You can now use all of your favorite R packages and functions in a distributed context.

  • Connect to External Data Sources using spark_read_source()spark_write_source()spark_read_jdbc() and spark_write_jdbc().

  • Use the Latest Frameworks including dplyr 0.7DBI 0.7RStudio 1.1and Spark 2.2.

I’ve been impressed with sparklyr so far.

Comments closed

R6 Classes In R

David Smith explains what R6 classes are in R:

The big advantage of R6 is that it makes it much easier to implement some common data structures in a user-friendly manner. For example, to implement a stack “pop” operation in S3 or S4 you have to do something like this:

x <- topval(mystack)
mystack <- remove_top(mystack)

In R6, the implementation is much simpler to use:

x <- mystack$pop()

David links to some good resources on the topic, so check those out as well.

Comments closed

R Services Internals

Niels Berglund has an excellent series on R Services internals.  Here’s the latest post:

This post is the ninth post about Microsoft SQL Server R Services, and the eight post that drills down into the internal of how it works.

So far in this series we have been looking at what happens in SQL Server as well as the launchpad service when we execute sp_execute_external_script, and we have still no real “clue” to where the R engine comes into play.

Well, hopefully that will change (at least a little bit) with this post, as we here will look at what happens when we leave the launchpad service.

This series is like candy to me.  It’s the best write-up I’ve seen so far about what’s really happening when you run SQL Server R Services.

Comments closed

Multiple Data Sets In External Scripts

Tomaz Kastrun shows a workaround to the “one data set” limit in sp_execute_external_script:

Some of the  arguments of the procedure sp_execute_external_script are enumerated. This is valid for the inputting dataset and as the name of argument @input_data_1 suggests, one can easily (and this is valid doubt) think, there can also be @input_data_2 argument, and so on. Unfortunately, this is not true.  External procedure can hold only one T-SQL dataset, inserted through this parameter.

There are many reasons for that, one would be the cost of sending several datasets to external process and back, so inadvertently, this forces user to rethink and pre-prepare the dataset (meaning, do all the data munging beforehand), prior to sending it into external procedure.

But there are workarounds on how to pass additional query/queries to sp_execute_external_script. I am not advocating this, and I strongly disagree with such usage, but here it is.

It does feel like a hinky solution, but sometimes you just need to get two data sets in.

Comments closed

Random Forests In R

Anish Sing Walia explains the basics of random forests and provides sample code in R:

Random Forests are similar to a famous Ensemble technique called Bagging but have a different tweak in it. In Random Forests the idea is to decorrelate the several trees which are generated on the different bootstrapped samples from training Data.And then we simply reduce the Variance in the Trees by averaging them.
Averaging the Trees helps us to reduce the variance and also improve the Perfomance of Decision Trees on Test Set and eventually avoid Overfitting.

The idea is to build lots of Trees in such a way to make the Correlation between the Trees smaller.

Random forests frequently give a good answer to classification problems, enough so as to make them a nice starting point.

Comments closed

Using seplyr Instead Of dplyr

John Mount explains seplyr and why it can be better for certain use cases than dplyr:

seplyr is a dplyr adapter layer that prefers “slightly clunkier” standard interfaces (or referentially transparent interfaces), which are actually very powerful and can be used to some advantage.

The above description and comparisons can come off as needlessly broad and painfully abstract. Things are much clearer if we move away from theory and return to our practical example.

Click through for a great example, and also read John’s comment on the Pascal-style assignment operator he uses.

Comments closed

R For Apache Impala

Ian Cook describes implyr, an R interface for Apache Impala:

dplyr provides a grammar of data manipulation, consisting of set of verbs (including mutate()select()filter()summarise(), and arrange()) that can be used together to perform common data manipulation tasks. The implyr package helps dplyr translate this grammar into Impala-compatible SQL commands. This gives R users access to Impala’s scale and speed on large distributed datasets while using the same familiar dplyr syntax that they are accustomed to using on local data frames and other data sources. R users can also choose to directly write SQL commands and execute them on Impala using implyr.

implyr builds upon recent work from RStudio and other contributors, including major updates to the packages dplyr and DBI, and new packages dbplyr and odbc. implyr together with these packages enables data scientists and data engineers to more easily interact with Impala through self-service data science tools like Cloudera Data Science Workbench.

It looks like this project is off to a good start already.

Comments closed