Category: R

SQL Server R Services Memory Usage

Published 2016-10-14 by Kevin Feasel

Ginger Grant looks at how SQL Server R Services handles memory allocation:

While R is an open source language, there are a number of different versions of R and each handles memory a little differently. Knowing which version is being used is important, especially when the code is going to be migrated to a server. As part of a SQL Server implementation, there are three different versions of R which come into play. The first is standard open source R, commonly known as CRAN R. This is the standard open source version of R which runs code in memory and is single threaded. The next version which will be installed as part of a SQL Server Installation is Microsoft R Open. This version of R was written to take advantage of the Intel Math Kernel Libraries [MLK]. Using the libraries speeds up many statistical calculations which use matrix operations. It also adds multi-threading capability to R as the rewrite provides the ability to use all available cores and processors and process in parallel. More information on how it works and how much faster Microsoft R Open is compared to standard R is available here. To use Microsoft R Open, once it is installed, in Rstudio should automatically start using it. To check out what version of R that is in use, within R Studio, go to Tools->Global Options and look at the R version.

If you’re concerned about R Services taking up too much server memory, you should look at Resource Governor.

Comments closed

Interactive Graphics With ggiraph

Published 2016-10-11 by Kevin Feasel

David Smith sheds some light on the ggiraph project:

R’s ggplot2 package is a well-known tool for producing beautiful static data visualizations that you can include in a printed report. But what if you want to include a ggplot2 graphic on a webpage and provide the ability for the user to interact with the data? The ggiraph package by David Gohel (available for installation via CRAN). WIth ggiraph, you can take an existing ggplot2 bar chart, scatterplot, boxplot, map, or many other types of chart and add one or both of the following iteractions:

Display a tooltip of your choice (e.g. data values or labels) when the cursor hovers over sections of the chart
Perform an action (a javascript function you provide: jump to another page, for example) when the viewer clicks on an element of the chart

I like it.

Comments closed

T-SQL And R Performance Comparisons

Published 2016-10-10 by Kevin Feasel

Tomaz Kastrun does several performance comparisons between various R packages and T-SQL constructs:

Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R packagedplyr and T-SQL with possibilites of RevoScaleR computation functions.

My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).

Read on for code and conclusions. I don’t think there are any shocking conclusions: the upshot is to filter data as early as possible.

Comments closed

NetworkD3

Published 2016-10-06 by Kevin Feasel

Vessy combines Javascript and R to visualize networks:

The networkD3 package provides a function called igraph_to_networkD3, that uses an igraph object to convert it into a format that networkD3 uses to create a network representation. As I used igraph object to store my network, including node and edge properties, I was hoping that I may only need to use this function to create a visualization of my network. However, this function does not work exactly like that (which is not that surprising, given the differences in how D3.js works and how igraph object is defined). Instead, it extracts lists of nodes and edges from the igraph object, but not the information about all node and edges properties (the exception is a priori specified information about nodes membership groups/clusters, which can be derived from one or more network properties, e.g., node degree). Additionally, the igraph_to_networkD3 function does not plot the network itself, but only extracts parameters that are later used in theforceNetwork function that plots the network.

This is the kind of thing I want to see when working with network data. It doesn’t necessarily scale, but given how well the human eye tracks relationships, this is very useful.

Comments closed

Sparklyr

Published 2016-10-04 by Kevin Feasel

RStudio has announced an interface between R and Apache Spark, named sparklyr:

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:

Interactively manipulate Spark data using both dplyr and SQL (via DBI).
Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
Orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater.
Create extensions that call the full Spark API and provide interfaces to Spark packages.
Integrated support for establishing Spark connections and browsing Spark DataFrames within the RStudio IDE.

So what’s the difference between sparklyr and SparkR?

@zedoring sparkR is “inspired by dplyr” and distributed with Spark, sparklyr is a proper dplyr back-end which will be on CRAN.

— Jeff Allen (@TrestleJeff) June 28, 2016

This might be the package I’ve been awaiting.

Comments closed

Using Xgboost In Azure ML Studio

Published 2016-09-29 by Kevin Feasel

Koos van Strien wants to use the xgboost model in Azure ML Studio:

Because the high-level path of bringing trained R models from the local R environment towards the cloud Azure ML is almost identical to the Python one I showed two weeks ago, I use the same four steps to guide you through the process:

Export the trained model
Zip the exported files
Upload to the Azure ML environment
Embed in your Azure ML solution

Read the whole thing.

Comments closed

Storing R Graphs In FileTable

Published 2016-09-26 by Kevin Feasel

Tomaz Kastrun shows how to save R plots in SQL Server FileTable:

FileTable has been around now for quite some time and and it is useful for storing files, documents, pictures and and binary files in a designated SQL Server table – FileTable. The best part of FileTable is the fact one can access it from windows or other application as if it were stored on file system (because they are) and not making any other changes on the client.

And this feature is absolutely handy for using and storing outputs from Microsoft R Server. In this blog post I will focus mainly on persistently storing charts from statistical analysis.

I can see this being quite useful for things like automatically sampling data for quality control.

Comments closed

XGBoost

Published 2016-09-22 by Kevin Feasel

Koos van Strien moves from Python to R to run an xgboost algorithm:

Note that the parameters of xgboost used here fall in three categories:

General parameters
- nthread (number of threads used, here 8 = the number of cores in my laptop)
Booster parameters
- max.depth (of tree)
- eta
Learning task parameters
- objective: type of learning task (softmax for multiclass classification)
- num_class: needed for the “softmax” algorithm: how many classes to predict?
Command Line Parameters
- nround: number of rounds for boosting

Read the whole thing.

Comments closed

One-Sample T Tests

Published 2016-09-21 by Kevin Feasel

Mala Mahadevan shows how to perform one-sample T Tests:

For this post I decided to go with a simple example of how many steps I walked with my per day for the month of August. My goal is 10,000 steps per day – that has been my average over the year but is this true of the data I gathered in August? I have a simple table with two columns – day and steps. Each record has how many steps I took in August per day, for 30 days. So – SELECT AVG(steps) FROM [dbo].[mala-steps] gives me 8262 as my average number of steps per day in August. I want to know if am consistently under performing my goal, or if this is a result of my being less active in August alone. Let me state my problem first – or state what is called ‘null hypothesis’:

I walk 10,000 steps on an average per year.

Read on for T test operations in T-SQL (although not all operations are available) and R.

Comments closed

Analyzing The StackLite Dataset

Published 2016-09-20 by Kevin Feasel

Marco Pasin looks at the StackLite data set:

According to Stack Overflow documentation, these are the categories of questions that may be closed by the community users:

duplicated

off topic

unclear

too broad

primarily opinion-based

Not everyone in the Stack Overflow community is able to close a question. In fact users need to have certain reputation expressed in points (more details here).

To calculate the overall website closure rate is easy. Just use the original “questions_2016” dataset and count how many questions have the field “Closed Date” populated. Over 10% of questions made in 2016 have been closed so far.

If you’re interested in learning more about data analysis, walk through the exercise as well and play around with the data set too. Hat tip, R-Bloggers.

Comments closed