ROC curves are commonly used to characterize the sensitivity/specificity tradeoffs for a binary classifier. Most machine learning classifiers produce real-valued scores that correspond with the strength of the prediction that a given case is positive. Turning these real-valued scores into yes or no predictions requires setting a threshold; cases with scores above the threshold are classified as positive, and cases with scores below the threshold are predicted to be negative. Different threshold values give different levels of sensitivity and specificity. A high threshold is more conservative about labelling a case as positive; this makes it less likely to produce false positive results but more likely to miss cases that are in fact positive (lower rate of true positives). A low threshold produces positive labels more liberally, so it is less specific (more false positives) but also more sensitive (more true positives). The ROC curve plots true positive rate against false positive rate, giving a picture of the whole spectrum of such tradeoffs.
ROC curves are one of the primary techniques for figuring out if a binary classifier “works.”
ML studio now gives you even more flexibility, with new language engines supported in the language modules. Within the Execute Python Script module, you can now choose to use Python 2.7.11 or Python 3.5, both of which run within the Acaconda 4.0 distribution. And within the Execute R Script module, you can now choose Microsoft R Open 3.2.2 as your R engine, in addition to the existing CRAN R 3.1.0 engine. Microsoft R Open 3.2.2 not only gives you a newer R language engine, it also gives you access to a wealth of new R packages for use within ML Studio. Over 400 packages are pre-installed for use with the R Script module, and you can install and use any other R package (including CRAN packages and your own R packages) via the Script Bundle input port.
I’m interested in the Microsoft R Open language support, as Azure ML’s still using a relatively older version of R (3.1.0).
This post is an extension of a previous one that appears here:https://drsimonj.svbtle.com/quick-plot-of-all-variables.
In that prior post, I explained a method for plotting the univariate distributions of many numeric variables in a data frame. This post does something very similar, but with a few tweaks that produce a very useful result. So, in general, I’ll skip over a few minor parts that appear in the previous post (e.g., how to use
purrr::keep()if you want only variables of a particular type).
Read on for code, including a good bit of tidyr.
This time, amit suggested I do some hierarchical clustering of the votes. So here goes a very dirty first attempt…
Check this out as a case study in data analysis.
Monte Carlo analysis is a great way to explore the impact of input variable uncertainty on the results of engineering equations, and with vector variables and distribution and sampling functions at its core, R is a natural platform for this analysis.
Check out his app, which has a link to the code. Amazingly, this is only 107 lines of code.
DeployR Enterprise is designed to deliver analytics solutions at scale to whomever needs it: inside or outside the enterprise. It also guarantees secure delivery of your analytics via DeployR web services. These secure web services integrate seamlessly with existing enterprise security solutions: Single Sign-On, LDAP, Active Directory, PAM, and Basic Authentication, can enforce access privileges already defined by your IT department for existing enterprise users and also have the capability to safely support anonymous users when needed.
There’s nothing groundbreaking here: it’s TLS (to encrypt network transmissions) and LDAPS (to control authentication and authorization). That there’s nothing groundbreaking is a good thing—that means companies will have most of the infrastructure in place to support this.
The first and most common measure of dispersion is called ‘Range‘. The range is just the difference between the maximum and minimum values in the dataset. It tells you how much gap there is between the two and therefore how wide the dataset is in terms of its values. It is however, quite misleading when you have outliers in the data. If you have one value that is very large or very small that can skew the Range and does not really mean you have values spanning the minimum to the maximum.
To lower this kind of an issue with outliers – a second variation of the range called Inter-Quartile Range, or IQR is used. The IQR is calculated by dividing the dataset into 4 equal parts after sorting the said value in ascending order. For the first and third part, the maximum values are taken and then subtracted from each other. The IQR ensures that you are looking at top and near-bottom ranges and therefore the value it gives is probably spanning the range.
Just like her previous post, this one also includes an example built for SQL Server R Services.
My goal is to do some of the things that I did in my Touching on Advanced Topics post. Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below. As a result, I was only able to do some—but not all—of the anticipated work. I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.
With that in mind, let’s start messing around.
SparkR is a bit of a mindset change from traditional R.
Here’s a little puzzle that might shed some light on some apparently confusing behaviour by missing values (NAs) in R:
What is NA^0 in R?
You can get the answer easily by typing at the R command line:
But the interesting question that arises is: why is it 1? Most people might expect that the answer would be NA, like most expressions that include NA. But here’s the trick to understanding this outcome: think of NA not as a number, but as a placeholder for a number that exists, but whose value we don’t know.
Definitely read the comments on this one.
With R integration into SQL Server 2016 we can pull an R script and integrate it rather easily. I will be covering all 3 approaches. I am using a small dataset – a single table with 915 rows, with a SQL Server 2016 installation and R Studio. The complexities of doing this type of analysis in the real world with bigger datasets involve setting various options for performance and dealing with memory issues – because R is very memory intensive and single threaded.
My table and the data it contains can be created with scripts here. For this specific post I used just one column in the table – age. For further posts I will be using the other fields such as country and gender.
Mala compares T-SQL versus R for calculating minimum, maximum, mean, and mode. She wraps the post up by showing how to call her R code via T-SQL using SQL Server R Services.