Press "Enter" to skip to content

Category: R

Sparklyr On EMR

Tom Zeng shows how to use sparklyr on Amazon ElasticMapReduce:

The recently released sparklyr package by RStudio has made processing big data in R a lot easier. sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API.

You can also install sparklyr locally and point to a Spark cluster.

Comments closed

R Services Resource Utilization

Ginger Grant shows off some R Services reports to see how hard the developers are battering your poor servers with their R scripts:

R Services – Extended Events is also not a report but a list of all the extended events that are available for R Services. This is a handy bit of information, which can be a great reference tool for extended events monitoring. R Services – Packages lists the packages which are currently installed on SQL Server. When people write R, many lot of different packages are used within the script. Prior to running a package, check the information on this report to ensure the libraries used are installed on SQL Server. If the library is missing the code will not work. R Services – Resource Usage is a great way to see how R has been configured to run on the server. Notice I have created an External Pool for R. This is a configuration recommended by Microsoft to better monitor your R Services.

Click through for more information, and grab the reports from Microsoft’s Github repo.

Comments closed

Linear Models

Andrea Spano, et al, are starting a new book:

This chapter is an introduction to the first section of the book, Linear Models, and contain some theoretical explanation and lots of examples. At the end of the chapter you will find two summary tables with Linear model formulae and functions in R and Common R functions for inference.

The book is just getting started, but you can get it from the Quantide website.  In the meantime, there are two other books on learning R and developing in R.  These books are licensed Creative Commons, so they’re free to read and share.

Comments closed

Machine Learning Algorithms In R

Ginger Grant has a list of machine learning algorithms and their implementations in R:

Often times determining which algorithm to use can take a while.  Here is a pretty good flowchart for determining which algorithm should be used given some examples of what the desired outcomes and data contain. The diagram lists the algorithms, which are implemented in Azure ML.  The same algorithms can be implemented in R.  In R there are libraries to help with nearly every task.  Here’s a list of libraries and their accompanying links which can be used in Machine Learning.  This list is no means comprehensive as there are libraries and functions other than the ones listed here, but if you are trying to write a Machine Learning Experiment in R, and are looking at the flowchart, these R functions and Libraries will provide the tools to do the types of Machine Learning Analysis listed.

I think algorithm determination is one of the most difficult parts of machine learning.  Even if you don’t mean to go there, the garden of forking paths is dangerous.

Comments closed

Association Rules

Tomaz Kastrun discusses product variants:

To sum up, association rules is a great and powerful algorithm for finding the correlations between items and the fact that you can use this straight from SSMS, it just gives me goosebumps. Currently just the performance is a bit of a drawback. Also comparing this algorithm to Analysis services (SSAS) association rules, there are many advantages on R side, because of maneuverability and extracting the data to T-SQL, but keep in mind, SSAS is still very awesome and powerful tool for statistical analysis and data predictions.

Figuring out variations after the fact is an all-too-common task, and this is a good way of getting some ideas on how to do that.

Comments closed

SQL Server R Services Memory Usage

Ginger Grant looks at how SQL Server R Services handles memory allocation:

While R is an open source language, there are a number of different versions of R and each handles memory a little differently. Knowing which version is being used is important, especially when the code is going to be migrated to a server. As part of a SQL Server implementation, there are three different versions of R which come into play. The first is standard open source R, commonly known as CRAN R. This is the standard open source version of R which runs code in memory and is single threaded. The next version which will be installed as part of a SQL Server Installation is Microsoft R Open. This version of R was written to take advantage of the Intel Math Kernel Libraries [MLK]. Using the libraries speeds up many statistical calculations which use matrix operations. It also adds multi-threading capability to R as the rewrite provides the ability to use all available cores and processors and process in parallel. More information on how it works and how much faster Microsoft R Open is compared to standard R is available here. To use Microsoft R Open, once it is installed, in Rstudio should automatically start using it. To check out what version of R that is in use, within R Studio, go to Tools->Global Options and look at the R version.

If you’re concerned about R Services taking up too much server memory, you should look at Resource Governor.

Comments closed

Interactive Graphics With ggiraph

David Smith sheds some light on the ggiraph project:

R’s ggplot2 package is a well-known tool for producing beautiful static data visualizations that you can include in a printed report. But what if you want to include a ggplot2 graphic on a webpage and provide the ability for the user to interact with the data? The ggiraph package by David Gohel  (available for installation via CRAN). WIth ggiraph, you can take an existing ggplot2 bar chart, scatterplot, boxplot, map, or many other types of chart and add one or both of the following iteractions:

  • Display a tooltip of your choice (e.g. data values or labels) when the cursor hovers over sections of the chart

  • Perform an action (a javascript function you provide: jump to another page, for example) when the viewer clicks on an element of the chart

I like it.

Comments closed

T-SQL And R Performance Comparisons

Tomaz Kastrun does several performance comparisons between various R packages and T-SQL constructs:

Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R packagedplyr and T-SQL with possibilites of RevoScaleR computation functions.

My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).

Read on for code and conclusions.  I don’t think there are any shocking conclusions:  the upshot is to filter data as early as possible.

Comments closed

NetworkD3

Vessy combines Javascript and R to visualize networks:

The networkD3 package provides a function called igraph_to_networkD3, that uses an igraph object to convert it into a format that networkD3 uses to create a network representation. As I used igraph object to store my network, including node and edge properties, I was hoping that I may only need to use this function to create a visualization of my network. However, this function does not work exactly like that (which is not that surprising, given the differences in how D3.js works and how igraph object is defined). Instead, it extracts lists of nodes and edges from the igraph object, but not the information about all node and edges properties (the exception is a priori specified information about nodes membership groups/clusters, which can be derived from one or more network properties, e.g., node degree). Additionally, the igraph_to_networkD3 function does not plot the network itself, but only extracts parameters that are later used in theforceNetwork function that plots the network.

This is the kind of thing I want to see when working with network data.  It doesn’t necessarily scale, but given how well the human eye tracks relationships, this is very useful.

Comments closed

Sparklyr

RStudio has announced an interface between R and Apache Spark, named sparklyr:

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:

  • Interactively manipulate Spark data using both dplyr and SQL (via DBI).

  • Filter and aggregate Spark datasets then bring them into R for analysis and visualization.

  • Orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater.

  • Create extensions that call the full Spark API and provide interfaces to Spark packages.

  • Integrated support for establishing Spark connections and browsing Spark DataFrames within the RStudio IDE.

So what’s the difference between sparklyr and SparkR?

This might be the package I’ve been awaiting.

Comments closed