R – Page 145 – Curated SQL

Installing R Packages In SQL Server

Published 2016-08-04 by Kevin Feasel

Tomaz Kastrun shows how to install packages in SQL Server R Services:

Julie Koesmarno made a great post on installing R packages. Please follow this post. Also Microsoft suggests the following way to install R packages on MSDN.

Since I wanted to be able to have packages installed directly from SQL Server Management Studio (SSMS) here is yet another way to do it. I have used xp_cmdshell to install any additional package for my R (optionally you can setEXECUTE AS USER).

This is a bit of a backdoor method, but it does work.

Comments closed

Understanding ROC Curves

Published 2016-08-03 by Kevin Feasel

Bob Horton explains ROC curves and shows how to create them in R:

ROC curves are commonly used to characterize the sensitivity/specificity tradeoffs for a binary classifier. Most machine learning classifiers produce real-valued scores that correspond with the strength of the prediction that a given case is positive. Turning these real-valued scores into yes or no predictions requires setting a threshold; cases with scores above the threshold are classified as positive, and cases with scores below the threshold are predicted to be negative. Different threshold values give different levels of sensitivity and specificity. A high threshold is more conservative about labelling a case as positive; this makes it less likely to produce false positive results but more likely to miss cases that are in fact positive (lower rate of true positives). A low threshold produces positive labels more liberally, so it is less specific (more false positives) but also more sensitive (more true positives). The ROC curve plots true positive rate against false positive rate, giving a picture of the whole spectrum of such tradeoffs.

ROC curves are one of the primary techniques for figuring out if a binary classifier “works.”

Comments closed

Azure ML Updates

Published 2016-08-03 by Kevin Feasel

David Smith walks us through new language engines supported in Azure ML:

ML studio now gives you even more flexibility, with new language engines supported in the language modules. Within the Execute Python Script module, you can now choose to use Python 2.7.11 or Python 3.5, both of which run within the Acaconda 4.0 distribution. And within the Execute R Script module, you can now choose Microsoft R Open 3.2.2 as your R engine, in addition to the existing CRAN R 3.1.0 engine. Microsoft R Open 3.2.2 not only gives you a newer R language engine, it also gives you access to a wealth of new R packages for use within ML Studio. Over 400 packages are pre-installed for use with the R Script module, and you can install and use any other R package (including CRAN packages and your own R packages) via the Script Bundle input port.

I’m interested in the Microsoft R Open language support, as Azure ML’s still using a relatively older version of R (3.1.0).

Comments closed

Plotting Variables Against One Another

Published 2016-08-02 by Kevin Feasel

Simon Jackson shows how to plot multiple variables against one another using R:

This post is an extension of a previous one that appears here:https://drsimonj.svbtle.com/quick-plot-of-all-variables.

In that prior post, I explained a method for plotting the univariate distributions of many numeric variables in a data frame. This post does something very similar, but with a few tweaks that produce a very useful result. So, in general, I’ll skip over a few minor parts that appear in the previous post (e.g., how to use purrr::keep() if you want only variables of a particular type).

Read on for code, including a good bit of tidyr.

Comments closed

UN Voting Clusters

Published 2016-07-27 by Kevin Feasel

En El Margen charts voting clusters in UN data:

After some more digging, and a suggestion by @theMexIndian I decided to see more in the depth the unvotes database that I wrote about some weeks ago.

This time, amit suggested I do some hierarchical clustering of the votes. So here goes a very dirty first attempt…

Check this out as a case study in data analysis.

Comments closed

Manning’s Equation

Published 2016-07-27 by Kevin Feasel

John Yagecic has a Shiny app which gives a Monte Carlo analysis of Manning’s Equation:

Monte Carlo analysis is a great way to explore the impact of input variable uncertainty on the results of engineering equations, and with vector variables and distribution and sampling functions at its core, R is a natural platform for this analysis.

Check out his app, which has a link to the code. Amazingly, this is only 107 lines of code.

Comments closed

Enterprise R Security

Published 2016-07-27 by Kevin Feasel

Ramkumar Chandrasekeran discusses DeployR, an enterprise security model for R:

DeployR Enterprise is designed to deliver analytics solutions at scale to whomever needs it: inside or outside the enterprise. It also guarantees secure delivery of your analytics via DeployR web services. These secure web services integrate seamlessly with existing enterprise security solutions: Single Sign-On, LDAP, Active Directory, PAM, and Basic Authentication, can enforce access privileges already defined by your IT department for existing enterprise users and also have the capability to safely support anonymous users when needed.

There’s nothing groundbreaking here: it’s TLS (to encrypt network transmissions) and LDAPS (to control authentication and authorization). That there’s nothing groundbreaking is a good thing—that means companies will have most of the infrastructure in place to support this.

Comments closed

Range And Variance

Published 2016-07-25 by Kevin Feasel

Mala Mahadevan looks at calculating range, variance, and standard deviation in R and T-SQL:

The first and most common measure of dispersion is called ‘Range‘. The range is just the difference between the maximum and minimum values in the dataset. It tells you how much gap there is between the two and therefore how wide the dataset is in terms of its values. It is however, quite misleading when you have outliers in the data. If you have one value that is very large or very small that can skew the Range and does not really mean you have values spanning the minimum to the maximum.

To lower this kind of an issue with outliers – a second variation of the range called Inter-Quartile Range, or IQR is used. The IQR is calculated by dividing the dataset into 4 equal parts after sorting the said value in ascending order. For the first and third part, the maximum values are taken and then subtracted from each other. The IQR ensures that you are looking at top and near-bottom ranges and therefore the value it gives is probably spanning the range.

Just like her previous post, this one also includes an example built for SQL Server R Services.

Comments closed

SparkR + Zeppelin

Published 2016-07-21 by Kevin Feasel

I take a look at using SparkR and Zeppelin:

My goal is to do some of the things that I did in my Touching on Advanced Topics post. Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below. As a result, I was only able to do some—but not all—of the anticipated work. I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.

With that in mind, let’s start messing around.

SparkR is a bit of a mindset change from traditional R.

Comments closed

Missing Values In R

Published 2016-07-20 by Kevin Feasel

David Smith explains NA values in R:

Here’s a little puzzle that might shed some light on some apparently confusing behaviour by missing values (NAs) in R:

What is NA^0 in R?

You can get the answer easily by typing at the R command line:

> NA^0
[1] 1

But the interesting question that arises is: why is it 1? Most people might expect that the answer would be NA, like most expressions that include NA. But here’s the trick to understanding this outcome: think of NA not as a number, but as a placeholder for a number that exists, but whose value we don’t know.

Definitely read the comments on this one.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: R