Press "Enter" to skip to content

Category: R

R6 Classes In R

David Smith explains what R6 classes are in R:

The big advantage of R6 is that it makes it much easier to implement some common data structures in a user-friendly manner. For example, to implement a stack “pop” operation in S3 or S4 you have to do something like this:

x <- topval(mystack)
mystack <- remove_top(mystack)

In R6, the implementation is much simpler to use:

x <- mystack$pop()

David links to some good resources on the topic, so check those out as well.

Comments closed

R Services Internals

Niels Berglund has an excellent series on R Services internals.  Here’s the latest post:

This post is the ninth post about Microsoft SQL Server R Services, and the eight post that drills down into the internal of how it works.

So far in this series we have been looking at what happens in SQL Server as well as the launchpad service when we execute sp_execute_external_script, and we have still no real “clue” to where the R engine comes into play.

Well, hopefully that will change (at least a little bit) with this post, as we here will look at what happens when we leave the launchpad service.

This series is like candy to me.  It’s the best write-up I’ve seen so far about what’s really happening when you run SQL Server R Services.

Comments closed

Multiple Data Sets In External Scripts

Tomaz Kastrun shows a workaround to the “one data set” limit in sp_execute_external_script:

Some of the  arguments of the procedure sp_execute_external_script are enumerated. This is valid for the inputting dataset and as the name of argument @input_data_1 suggests, one can easily (and this is valid doubt) think, there can also be @input_data_2 argument, and so on. Unfortunately, this is not true.  External procedure can hold only one T-SQL dataset, inserted through this parameter.

There are many reasons for that, one would be the cost of sending several datasets to external process and back, so inadvertently, this forces user to rethink and pre-prepare the dataset (meaning, do all the data munging beforehand), prior to sending it into external procedure.

But there are workarounds on how to pass additional query/queries to sp_execute_external_script. I am not advocating this, and I strongly disagree with such usage, but here it is.

It does feel like a hinky solution, but sometimes you just need to get two data sets in.

Comments closed

Random Forests In R

Anish Sing Walia explains the basics of random forests and provides sample code in R:

Random Forests are similar to a famous Ensemble technique called Bagging but have a different tweak in it. In Random Forests the idea is to decorrelate the several trees which are generated on the different bootstrapped samples from training Data.And then we simply reduce the Variance in the Trees by averaging them.
Averaging the Trees helps us to reduce the variance and also improve the Perfomance of Decision Trees on Test Set and eventually avoid Overfitting.

The idea is to build lots of Trees in such a way to make the Correlation between the Trees smaller.

Random forests frequently give a good answer to classification problems, enough so as to make them a nice starting point.

Comments closed

Using seplyr Instead Of dplyr

John Mount explains seplyr and why it can be better for certain use cases than dplyr:

seplyr is a dplyr adapter layer that prefers “slightly clunkier” standard interfaces (or referentially transparent interfaces), which are actually very powerful and can be used to some advantage.

The above description and comparisons can come off as needlessly broad and painfully abstract. Things are much clearer if we move away from theory and return to our practical example.

Click through for a great example, and also read John’s comment on the Pascal-style assignment operator he uses.

Comments closed

R For Apache Impala

Ian Cook describes implyr, an R interface for Apache Impala:

dplyr provides a grammar of data manipulation, consisting of set of verbs (including mutate()select()filter()summarise(), and arrange()) that can be used together to perform common data manipulation tasks. The implyr package helps dplyr translate this grammar into Impala-compatible SQL commands. This gives R users access to Impala’s scale and speed on large distributed datasets while using the same familiar dplyr syntax that they are accustomed to using on local data frames and other data sources. R users can also choose to directly write SQL commands and execute them on Impala using implyr.

implyr builds upon recent work from RStudio and other contributors, including major updates to the packages dplyr and DBI, and new packages dbplyr and odbc. implyr together with these packages enables data scientists and data engineers to more easily interact with Impala through self-service data science tools like Cloudera Data Science Workbench.

It looks like this project is off to a good start already.

Comments closed

R Model Compression

I have a post showing off some of the value of compressing R models:

So right now, we’re burning roughly 200K per model.  My stated goal is to be able to store several years worth of data for 10 million products.  Let’s say that I need 10 million products in ProductModel and 1 billion rows in ProductModelHistory.  That means that we’d end up with 1.86 TB of data in the ProductModel table and 186 TB in ProductModelHistory.  This seems…excessive.

As a result, I decided to try using the COMPRESS() function in SQL Server 2016.  The COMPRESS function simply uses GZip compression.  Yeah, there are compression algorithms which tend to be more compact (e.g., bz2 or 7z), but GZip is relatively CPU efficient and I can wrap my SQL statements with COMPRESS() and DECOMPRESS() and not have to change any calling code; I just need to update the two stored procedures I use to insert and then retrieve product models.

Most of the time, it’s not a big deal.  But once you start talking hundreds of gigabytes or in my case, a couple hundred terabytes, it’s definitely worth compressing this data.

Comments closed

Neural Networks From Scratch

Ilia Karmanov explains neural nets and shows how to build one in R:

Hence, my motivation for this post is two-fold:

  1. Understanding (by writing from scratch) the leaky abstractions behind neural-networks dramatically shifted my focus to elements whose importance I initially overlooked. If my model is not learning I have a better idea of what to address rather than blindly wasting time switching optimisers (or even frameworks).
  2. A deep-neural-network (DNN), once taken apart into lego blocks, is no longer a black-box that is inaccessible to other disciplines outside of AI. It’s a combination of many topics that are very familiar to most people with a basic knowledge of statistics. I believe they need to cover very little (just the glue that holds the blocks together) to get an insight into a whole new realm.

Starting from a linear regression we will work through the maths and the code all the way to a deep-neural-network (DNN) in the accompanying R-notebooks. Hopefully to show that very little is actually new information.

This is pretty detailed.  Karmanov mentions Andrej Karpathy, whose Hacker’s guide to Neural Networks is also a must-read on the topic.

Comments closed

Apache Drill Interface For R

Bob Rudis announces a new package on CRAN:

I’m extremely pleased to announce that the sergeant package is now on CRAN or will be hitting your local CRAN mirror soon.

sergeant provides JDBC, DBI and dplyr/dbplyr interfaces to Apache Drill. I’ve also wrapped a few goodies into the dplyr custom functions that work with Drill and if you have Drill UDFs that don’t work “out of the box” with sergeant‘s dplyr interface, file an issue and I’ll make a special one for it in the package.

Seems quite useful if you’re working with MapR.  H/T R-bloggers

Comments closed

Better Grouping With dplyr

John Mount builds a function to improve upon the group-by to mutate model in dplyr:

The advantages of the shorthand are:

  • The analyst only has to specify the grouping column once.
  • The data (mtcars) enters the pipeline only once.
  • The analyst doesn’t have to start thinking about joins immediately.

Frankly I’ve never liked the shorthand. I feel it is a “magic extra” that a new user would have no way of anticipating from common use of group_by() and summarize(). I very much like the idea of wrapping this important common use case into a single verb. Adjoining “windowed” or group-calculated columns is a common and important step in analysis, and well worth having its own verb.

Below is our attempt at elevating this pattern into a packaged verb.

Click through for the script.  I’d like to see something like this make its way into dplyr.

Comments closed