Understanding ROC Curves

Bob Horton explains ROC curves and shows how to create them in R:

ROC curves are commonly used to characterize the sensitivity/specificity tradeoffs for a binary classifier. Most machine learning classifiers produce real-valued scores that correspond with the strength of the prediction that a given case is positive. Turning these real-valued scores into yes or no predictions requires setting a threshold; cases with scores above the threshold are classified as positive, and cases with scores below the threshold are predicted to be negative. Different threshold values give different levels of sensitivity and specificity. A high threshold is more conservative about labelling a case as positive; this makes it less likely to produce false positive results but more likely to miss cases that are in fact positive (lower rate of true positives). A low threshold produces positive labels more liberally, so it is less specific (more false positives) but also more sensitive (more true positives). The ROC curve plots true positive rate against false positive rate, giving a picture of the whole spectrum of such tradeoffs.

ROC curves are one of the primary techniques for figuring out if a binary classifier “works.”

Structured Streaming

Matei Zaharia, et al discuss how to use structured streaming in Apache Spark 2.0:

In Structured Streaming, we tackle the issue of semantics head-on by making a strong guarantee about the system: at any time, the output of the application is equivalent to executing a batch job on a prefix of the data. For example, in our monitoring application, the result table in MySQL will always be equivalent to taking a prefix of each phone’s update stream (whatever data made it to the system so far) and running the SQL query we showed above. There will never be “open” events counted faster than “close” events, duplicate updates on failure, etc. Structured Streaming automatically handles consistency and reliability both within the engine and in interactions with external systems (e.g. updating MySQL transactionally).

If you want to learn more about streaming data using Spark, check this out.

Solving The Cube Processing Mystery

SQL Sasquatch has a follow-up on his last post and answers the question of why there was a gap:

Last week I posted some of my perfmon graphs from an SSAS server.  I want to model the work happening on an SSAS server during multidimensional cubes processing – both dimension and fact processing.

That’s here:
SSAS Multidimensional Cube Processing – What Kind of Work is Happening?

There was a window at the end of the observation period where data was still flowing across the network, CPU utilization was still high enough to indicate activity, but the counters for rows read/converted/written/created per second were all zero.  The index rows/sec counter was also zero.

Check it out.


Aaron Bertrand discusses two Unicode schools of thought:

One of the more common dilemmas schema designers face, and I’m realizing it’s something that I should probably spend more time on in my presentations, is whether to use varchar or nvarchar for columns that will store string data. As part of my #EntryLevel challenge, I thought I’d start by writing a bit about this here.

In general, I come across two schools of thought on this:

  1. Use varchar unless you know you need to support Unicode.

  2. Use nvarchar unless you know you don’t.

My preference is to start with Unicode.  15 years ago, you could easily get away with using ASCII for most US-developed systems, but the likelihood that you will need to store data in multiple, varied languages is significantly higher today.  And having already refactored one application to support Unicode after it became fairly large, I’d rather not do that again…

Human-Readable Ranges

Kevin Feasel



Daniel Hutmacher shows us how to build human-readable ranges of integers and dates:

This is a real-world problem that I came across the other day. In a reporting scenario, I wanted to output a number of values in an easy, human-readable way for a report. But just making a long, comma-separated string of numbers doesn’t really make it very readable. This is particularly true when there are hundreds of values.

So here’s a powerful pattern to solve that task.

I really like this.  It takes the gaps & islands problem and goes one step further.

Azure ML Updates

David Smith walks us through new language engines supported in Azure ML:

ML studio now gives you even more flexibility, with new language engines supported in the language modules. Within the Execute Python Script module, you can now choose to use Python 2.7.11 or Python 3.5, both of which run within the Acaconda 4.0 distribution. And within the Execute R Script module, you can now choose Microsoft R Open 3.2.2 as your R engine, in addition to the existing CRAN R 3.1.0 engine. Microsoft R Open 3.2.2 not only gives you a newer R language engine, it also gives you access to a wealth of new R packages for use within ML Studio. Over 400 packages are pre-installed for use with the R Script module, and you can install and use any other R package (including CRAN packages and your own R packages) via the Script Bundle input port.

I’m interested in the Microsoft R Open language support, as Azure ML’s still using a relatively older version of R (3.1.0).

vNet Peering Within An Azure Region

Denny Cherry reports that there is a public preview of a feature to allow vNet peering without setting up a site-to-site VPN connection:

Up until August 1st if you had 2 vNets in the same Azure region (USWest for example) you needed to create a site to site VPN between them in order for the VMs within each vNet to be able to see each other.  I’m happy to report that this is no longer the case (it is still the default configuration).  On August 1st, 2016 Microsoft released a new version of the Azure portal which allows you to enable vNet peering between vNets within an account.

Now this feature is in public preview (aka. Beta) so you have to turn it on, which is done through Azure PowerShell. Thankfully it uses the Register-AzureRmProviderFeature cmdlet so you don’t need to have the newest Azure PowerShell installed, just something fairly recent (I have 1.0.7 installed). To enable the feature just request to be included in the beta like so (don’t forget to login with add-AzureRmAccount and then select-AzureRmSubscription).

Read the whole thing for details on how to enroll in this feature and how to set it up.

Thinking About Index Design

Jeremiah Peschka looks at a scenario in which a heap might be superior to a clustered index:

In this case, we have to assume that Event IDs may be coming from anywhere and, as such, may not arrive in order. Even though we’re largely appending to the table, we may not be appending in a strict order. Using a clustered index to support the table isn’t the best option in this case – data will be inserted somewhat randomly. We’ll spend maintenance cycles defragmenting this data.

Another downside to this approach is that data is largely queried by Owner ID. These aren’t unique, and one Owner IDcould have many events or only a few events. To support our querying pattern we need to create a multi-column clustering key or create an index to support querying patterns.

This result is not intuitive to me, and I recommend reading the whole thing.

Don’t Use Cron For Scheduling Hadoop Jobs

Matthew Rathbone explains why cron is not a great choice for scheduling Hadoop and Spark jobs:

Reason 3: Poor transparency for teammates

Which jobs are running right now? Which are going to run today? How long do these jobs take? How do I schedule my job? What machine should I schedule it on? These are all questions that are impossible to answer without building custom orchestration around your Cron process – time you’d be better off spending on building a better system.

Matthew then gives us four alternative products.

Plotting Variables Against One Another

Kevin Feasel



Simon Jackson shows how to plot multiple variables against one another using R:

This post is an extension of a previous one that appears here:https://drsimonj.svbtle.com/quick-plot-of-all-variables.

In that prior post, I explained a method for plotting the univariate distributions of many numeric variables in a data frame. This post does something very similar, but with a few tweaks that produce a very useful result. So, in general, I’ll skip over a few minor parts that appear in the previous post (e.g., how to use purrr::keep() if you want only variables of a particular type).

Read on for code, including a good bit of tidyr.


May 2018
« Apr