Quickly Computing Area Under The Curve

Jean-Francois Puget has a fast method for computing Area Under the Curve in Python:

When the target only takes two values we have a binary classification problem at hand.  Example of binary classification are very common. For instance fraud detection where examples are credit card transactions, features are time, location, amount, merchant id, etc., and target is fraud or not fraud.  Spam detection is also a binary classification where examples are emails, features are the email content as a string of words, and target is spam or not spam.  Without loss of generality we can assume that the target values are 0 and 1, for instance 0 means no fraud or no spam, whiloe 1 means fraud or spam.

For binary classification, predictions are also binary.  Therefore, a prediction is either equal to the target, or is off the mark.  A simple way to evaluate model performance is accuracy: how many predictions are right? For instance, if our test set has 100 examples in it, how many times is the prediction correct?  Accuracy seems a logical way to evaluate performance: a higher accuracy obviously means a better model.  At least this is what people think when they are exposed to the first time to binary classification problems.  Issue is that accuracy can be extremely misleading.

Read Jean-Francois’ explanation and scroll down for the Python sample.

Related Posts

Calculating AUC in R

Andrew Treadway shows how you can calculate Area Under the Curve in R: AUC is an important metric in machine learning for classification. It is often used as a measure of a model’s performance. In effect, AUC is a measure between 0 and 1 of a model’s performance that rank-orders predictions from a model. For […]

Read More

Python versus R (Again)

Alex Woodie looks at whether Python is dominating R in the data science space: There is some evidence that Python’s popularity is hurting R usage. According to the TIOBE Index, Python is currently the third most popular language in the world, behind perennial heavyweights Java and C. From August 2018 to August 2019, Python usage surged […]

Read More


November 2017
« Oct Dec »