# Category: R

My goal is to do some of the things that I did in my Touching on Advanced Topics post.  Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below.  As a result, I was only able to do some—but not all—of the anticipated work.  I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.

With that in mind, let’s start messing around.

SparkR is a bit of a mindset change from traditional R.

Here’s a little puzzle that might shed some light on some apparently confusing behaviour by missing values (NAs) in R:

What is NA^0 in R?

You can get the answer easily by typing at the R command line:

> NA^0
[1] 1

But the interesting question that arises is: why is it 1? Most people might expect that the answer would be NA, like most expressions that include NA. But here’s the trick to understanding this outcome: think of NA not as a number, but as a placeholder for a number that exists, but whose value we don’t know.

With R integration into SQL Server 2016 we can pull an R script and integrate it rather easily. I will be covering all 3 approaches. I am using a small dataset – a single table with 915 rows, with a SQL Server 2016 installation and R Studio. The complexities of doing this type of analysis in the real world with bigger datasets involve setting various options for performance and dealing with memory issues – because R is very memory intensive and single threaded.

My table and the data it contains can be created with scripts here. For this specific post I used just one column in the table – age. For further posts I will be using the other fields such as country and gender.

Mala compares T-SQL versus R for calculating minimum, maximum, mean, and mode.  She wraps the post up by showing how to call her R code via T-SQL using SQL Server R Services.

Remember chemistry class in high school or college?  You might remember having to keep a lab notebook for your experiments.  The purpose of this notebook was two-fold:  first, so you could remember what you did and why you did each step; second, so others could repeat what you did.  A well-done lab notebook has all you need to replicate an experiment, and independent replication is a huge part of what makes hard sciences “hard.”

Take that concept and apply it to statistical analysis of data, and you get the type of notebook I’m talking about here.  You start with a data set, perform cleansing activities, potentially prune elements (e.g., getting rid of rows with missing values), calculate descriptive statistics, and apply models to the data set.

I didn’t realize just how useful notebooks were until I started using them regularly.

In such a case, if a developer were to implement Dijkstra’s algorithm to compute the shortest path within the database using T-SQL, then they could use approaches like the one at Hans Oslov’s blog. Hans offers a clever implementation using recursive CTEs, which functionally does the job well. This is a fairly complex problem for the T-SQL language, and Hans’ implementation does a great job of modelling a graph data structure in T-SQL. However, given that T-SQL is mostly a transaction and query processing language, this implementation isn’t very performant, as you can see below.

The important thing to remember is that these technologies tend to complement each other rather than supplant them.

The first step is to load the RevoScaleR library. This is an amazing library that allows to create scalable and performant applications with R.

Then a connection string is defined, in my case using Windows Authentication. If you want to use SQL Server authentication the user name and password are needed.

We define a local folder as the compute context.

RxInSQLServer: generates a SQL Server compute context using SQL Server R Services –documentation

Sample query: I already prepared the dataset in the view, this is a best practice in order to reduce the size of the query in the R code and for me is also easier to maintain.

I think there’s a lot of value in learning R, regardless of whether you have “data analyst” in your role or job title.

Enter the Microsoft R Client. It includes Microsoft R Open, and adds in some of the ScaleR functions, which makes processing data faster and more efficient. And again, it’s a full R environment – you can write and run code, right there on your desktop. But the important bit is that it can connect to a Microsoft R Server (MRS) by seting something called the “Compute Context“, which tells the R environment to run on a more powerful, scalable server environment, like you may be used to with SQL Server.

The naming is a bit of a head-scratcher, to be honest.

It’s not uncommon for tests to be written at the get-go and then forgotten about. Remember that as code changes or incorrect behavior is found, new tests need to be written or existing tests need to be modified. Possibly worse than having no tests is having a bunch of tests spitting out false positives. This is because humans are prone to habituation and desensitization. It’s easy to become habituated to false positives to the point where we no longer pay attention to them.

Temporarily disabling tests may be acceptable in the short term. A more strategic solution is to optimize your test writing. The easier it is to create and modify tests, the more likely they will be correct and continue to provide value. For my testing, I generally write code to automate a lot of wiring to verify results programmatically.

I started this article with almost no idea how to test R code.  I still don’t…but this article does help.  I recommend reading it if you want to write production-quality R code.

In this post we’ll try to replicate some of the charts created by the Federal Reserve which visualize some well known macroeconomic indicators. We’ll also showcase the new Plotly 4.0 syntax.

This is a very code-heavy blog post and is a good way to learn about plotly.