Area Under The ROC Is Not Accuracy

Stephen Chen debunks bad journalistic summaries of a Google research paper:

Journalists latched onto Google’s NN 0.95 score vs. the comparison 0.86 (see EWS Strawman below), as the accuracy of determining mortality. However the actual metric the researchers used is AUROC (Area Under Receiver Operating Characteristic Curve) and not a measure of predictive accuracy that indexes the difference between the predicted vs. actual like RMSE (Root Mean Squared Error) or MAPE (Mean Absolute Percentage Error). Some articles even erroneously try to explain the 0.95 as the odds ratio.

Just as the concept of significance has different meanings to statisticians and laypersons, AUROC as a measure of model accuracy does not mean the probability of Google’s NN predicting mortality accurately as journalists/laypersons have taken it to mean. The ROC (see sample above) is a plot of a model’s False Positive Rate (i.e. predicting mortality where there is none) vs. the True Positive Rate (i.e. correctly predicting mortality). A larger area under the curve (AUROC) means the model produces less False Positives, not the certainty of mortality as journalists erroneously suggest.

The researchers themselves made no claim to soothsayer abilities, what they said in the paper was:

… (their) deep learning model would fire half the number of alerts of a traditional predictive model, resulting in many fewer false positives.

It’s an interesting article and a reminder of the importance of terminological precision (something I personally am not particularly good at).

There Is No Easy Button With Predictive Analytics

Scott Mutchler dispels some myths:

There are a couple of myths that I see more an more these days.  Like many myths they seem plausible on the surface but experienced data scientist know that the reality is more nuanced (and sadly requires more work).

Myths:

  • Deep learning (or Cognitive Analytics) is an easy button.  You can throw massive amounts of data and the algorithm will deliver a near optimal model.
  • Big data is always better than small data.  More rows of data always results in a significantly better model than less rows of data.

Both of these myths lead some (lately it seems many) people to conclude that data scientist will eventually become superfluous.  With enough data and advanced algorithms maybe we don’t need these expensive data scientists…

Read on for a dismantling of these myths.  There’s a lot more than “collect all of the data and throw it at an algorithm” (and even then, “all” the data rarely really means all, which I think deserves to be a third myth).  H/T R-bloggers

SQL Server Memory Usage Myths

Eric Blinn has a few myths regarding memory usage in SQL Server:

My VM administrator says that I’m not using all the memory I asked for.  In fact, 70% of it is idle at any given time.  We’re going to return that memory to the resource pool to better utilize it on other VMs.

The VM administrators are not lying or misinterpreting what is on their screen.  The metrics displayed on their management tool (Microsoft Hyper V Manager or VMWare vSphere Client) are lying to them.  When the VM management tool is checking on memory activity it asks the OS.  The OS only knows that this memory was allocated to SQL Server a long time ago and that it hasn’t moved since.  It appears to the OS to be stagnant and unused and this is what it reports to the VM management tool.  However, this memory, once allocated to SQL Server, is in the domain of SQLOS which is likely very actively using this memory in a way that is largely invisible to the OS, and by extension, the VM management tool and eventually the VM administrators themselves.  The VM tools are not yet smart enough to ask SQLOS what it is going on with the memory and report falsely that the memory is not being effectively utilized.

Worth reading, particularly if your sysadmins are trying to free up some of that “unused” memory.

Separating Data And Log Files

Brent Ozar looks at an old chestnut:

So it’s time for a quiz:

  1. If you put all of a SQL Server’s data files & logs on a single volume, how many failures will that server experience per year?
    • Bonus question: what kinds of data loss and downtime will each of those failure(s) have?
  2. If you split a SQL Server’s data files onto one volume, and log files onto another volume, how many failures will that server experience per year?
    • Bonus question: what kinds of data loss and downtime will each of those failures have?

Think carefully about the answers – or read the comments to see someone else’s homework, hahaha – before you move on.

With SANs, this advice is not even that good on the performance side—especially with modern SANs which don’t let you dedicate spindles.  It definitely doesn’t fly on the reliability side.

47 Incorrect Deployment Assumptions

Brent Ozar has a list of 47 assumptions regarding database deployments that turn out not always to be true:

30. The deployment person wouldn’t dream of only highlighting some of it and running it.

31. The staff who were supposed to work with you during the deployment will be available.

32. The staff, if available at the start of the call, will be available during the entire call.

33. The staff won’t come down with food poisoning halfway through the deployment call, forget to mute their home office phone, step into the bathroom, and leave the bathroom door open.

I’ve never had item #33 happen to me, but that’s a pretty solid list of stuff that can go wrong.

Benchmarking

Lukas Eder notes that the best way to compare performance is to…compare performance:

To bust a myth, if you have good reasons to think that a differently written, but semantically equivalent query might be faster (on your database), you should measure. Don’t even trust any execution plan, because ultimately, what really counts is the wall clock time in your production system.

If you can measure your queries in production, that’s perfect. But often, you cannot – but you don’t always have to. One way to compare two queries with each other is to benchmark them by executing each query hundreds or even thousands of times in a row.

Lukas goes on to compare a left join to a correlated subquery in three separate database products and the results are very interesting although absolutely not comparable across products because the DeWitt Clause is still a thing.  Great read.

Clustered Index And Physical Storage

Wayne Sheffield busts a myth:

In several of my last few blog posts, I’ve shared several methods of getting internal information from a database by using the DBCC PAGE command and utilizing the “WITH TABLERESULTS” option to be allowed to automate this process for further processing. This post will also do this, but in this case, we’ll be using it to bust a common myth—data in a clustered index is physically stored on disk in the order of the clustered index.

Busting this myth

To bust this myth, we’ll create a database, put a table with a clustered index into this database, and then we’ll add some rows in random order. Next, we will show that the rows are stored on the pages in logical order, and then we’ll take a deeper look at the page internals to see that the rows are not stored in physical order.

Read on for the proof.

Subqueries And Performance

Grant Fritchey busts a myth:

I’ve written before about the concept of cargo cult data professionals. They see one issue, one time, and consequently extrapolate that to all issues, all the time. It’s the best explanation I have for why someone would suggest that a sub-query is flat out wrong and will hurt performance.

Let me put a caveat up front (which I will reiterate in the conclusion, just so we’re clear), there’s nothing magically good about sub-queries just like there is nothing magically evil about sub-queries. You can absolutely write a sub-query that performs horribly, does horrible things, runs badly, and therefore absolutely screws up your system. Just as you can with any kind of query. I am addressing the bad advice that a sub-query is to be avoided because they will inherently lead to poor performance.

There are times not to use subqueries, but this post is absolutely correct:  understand the reasons why things may or may not perform well, and don’t be afraid to try things out.

Fallacies of Data Science

Adnan Masood and David Lazar have a list of fallacies in the world of data science:

Extrapolating beyond the range of training data, especially in the case of time series data, is fine providing the data-set is large enough.

Strong Evidence is same as a Proof! Prediction intervals and confidence intervals are the same thing, just like statistical significance and practical significance.

These are some good things to think about if you’re getting into analytics.

NOCOUNT Performance Differences

Aaron Bertrand looks into whether SET NOCOUNT ON provides a performance improvement:

For years, I had been operating under the impression that SET NOCOUNT ON; was a critical part of any performance strategy. This was based on observations I had made in, arguably, a different era, and that are less likely to manifest today.

Check out the comments as well.  This is an interesting conundrum as there’s a lot of ingrained knowledge that SET NOCOUNT ON is faster (and I admit that I thought I remembered it being the case when going through loops), but people have had limited success in coming up with a scenario in which it makes an appreciable difference.

Categories

August 2018
MTWTFSS
« Jul  
 12345
6789101112
13141516171819
20212223242526
2728293031