Stephen Chen debunks bad journalistic summaries of a Google research paper:

Journalists latched onto Google’s NN 0.95 score vs. the comparison 0.86 (see **EWS Strawman** below), as the accuracy of determining mortality. However the actual metric the researchers used is AUROC (Area Under Receiver Operating Characteristic Curve) and not a measure of predictive accuracy that indexes the difference between the predicted vs. actual like RMSE (Root Mean Squared Error) or MAPE (Mean Absolute Percentage Error). Some articles even erroneously try to explain the 0.95 as the odds ratio.

Just as the concept of significance has different meanings to statisticians and laypersons, AUROC as a measure of model accuracy does not mean the probability of Google’s NN predicting mortality accurately as journalists/laypersons have taken it to mean. The ROC (see sample above) is a plot of a model’s False Positive Rate (i.e. predicting mortality where there is none) vs. the True Positive Rate (i.e. correctly predicting mortality). A larger area under the curve (AUROC) __means the model produces less False Positives__, not the certainty of mortality as journalists erroneously suggest.

The researchers themselves made no claim to soothsayer abilities, what they said in the paper was:

*… (their) deep learning model would fire half the number of alerts of a traditional predictive model, resulting in many fewer false positives.*

It’s an interesting article and a reminder of the importance of terminological precision (something I personally am not particularly good at).