This time, amit suggested I do some hierarchical clustering of the votes. So here goes a very dirty first attempt…
Check this out as a case study in data analysis.
Monte Carlo analysis is a great way to explore the impact of input variable uncertainty on the results of engineering equations, and with vector variables and distribution and sampling functions at its core, R is a natural platform for this analysis.
Check out his app, which has a link to the code. Amazingly, this is only 107 lines of code.
DeployR Enterprise is designed to deliver analytics solutions at scale to whomever needs it: inside or outside the enterprise. It also guarantees secure delivery of your analytics via DeployR web services. These secure web services integrate seamlessly with existing enterprise security solutions: Single Sign-On, LDAP, Active Directory, PAM, and Basic Authentication, can enforce access privileges already defined by your IT department for existing enterprise users and also have the capability to safely support anonymous users when needed.
There’s nothing groundbreaking here: it’s TLS (to encrypt network transmissions) and LDAPS (to control authentication and authorization). That there’s nothing groundbreaking is a good thing—that means companies will have most of the infrastructure in place to support this.
The first and most common measure of dispersion is called ‘Range‘. The range is just the difference between the maximum and minimum values in the dataset. It tells you how much gap there is between the two and therefore how wide the dataset is in terms of its values. It is however, quite misleading when you have outliers in the data. If you have one value that is very large or very small that can skew the Range and does not really mean you have values spanning the minimum to the maximum.
To lower this kind of an issue with outliers – a second variation of the range called Inter-Quartile Range, or IQR is used. The IQR is calculated by dividing the dataset into 4 equal parts after sorting the said value in ascending order. For the first and third part, the maximum values are taken and then subtracted from each other. The IQR ensures that you are looking at top and near-bottom ranges and therefore the value it gives is probably spanning the range.
Just like her previous post, this one also includes an example built for SQL Server R Services.
My goal is to do some of the things that I did in my Touching on Advanced Topics post. Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below. As a result, I was only able to do some—but not all—of the anticipated work. I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.
With that in mind, let’s start messing around.
SparkR is a bit of a mindset change from traditional R.
Here’s a little puzzle that might shed some light on some apparently confusing behaviour by missing values (NAs) in R:
What is NA^0 in R?
You can get the answer easily by typing at the R command line:
But the interesting question that arises is: why is it 1? Most people might expect that the answer would be NA, like most expressions that include NA. But here’s the trick to understanding this outcome: think of NA not as a number, but as a placeholder for a number that exists, but whose value we don’t know.
Definitely read the comments on this one.
With R integration into SQL Server 2016 we can pull an R script and integrate it rather easily. I will be covering all 3 approaches. I am using a small dataset – a single table with 915 rows, with a SQL Server 2016 installation and R Studio. The complexities of doing this type of analysis in the real world with bigger datasets involve setting various options for performance and dealing with memory issues – because R is very memory intensive and single threaded.
My table and the data it contains can be created with scripts here. For this specific post I used just one column in the table – age. For further posts I will be using the other fields such as country and gender.
Mala compares T-SQL versus R for calculating minimum, maximum, mean, and mode. She wraps the post up by showing how to call her R code via T-SQL using SQL Server R Services.
Remember chemistry class in high school or college? You might remember having to keep a lab notebook for your experiments. The purpose of this notebook was two-fold: first, so you could remember what you did and why you did each step; second, so others could repeat what you did. A well-done lab notebook has all you need to replicate an experiment, and independent replication is a huge part of what makes hard sciences “hard.”
Take that concept and apply it to statistical analysis of data, and you get the type of notebook I’m talking about here. You start with a data set, perform cleansing activities, potentially prune elements (e.g., getting rid of rows with missing values), calculate descriptive statistics, and apply models to the data set.
I didn’t realize just how useful notebooks were until I started using them regularly.
In such a case, if a developer were to implement Dijkstra’s algorithm to compute the shortest path within the database using T-SQL, then they could use approaches like the one at Hans Oslov’s blog. Hans offers a clever implementation using recursive CTEs, which functionally does the job well. This is a fairly complex problem for the T-SQL language, and Hans’ implementation does a great job of modelling a graph data structure in T-SQL. However, given that T-SQL is mostly a transaction and query processing language, this implementation isn’t very performant, as you can see below.
The important thing to remember is that these technologies tend to complement each other rather than supplant them.
The first step is to load the RevoScaleR library. This is an amazing library that allows to create scalable and performant applications with R.
Then a connection string is defined, in my case using Windows Authentication. If you want to use SQL Server authentication the user name and password are needed.
We define a local folder as the compute context.
RxInSQLServer: generates a SQL Server compute context using SQL Server R Services –documentation
Sample query: I already prepared the dataset in the view, this is a best practice in order to reduce the size of the query in the R code and for me is also easier to maintain.
I think there’s a lot of value in learning R, regardless of whether you have “data analyst” in your role or job title.