Handling Imbalanced Data

Tom Fawcett shows us how to handle a tricky classification problem:

The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue.

Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. —Here are some examples:

  1. About 2% of credit card accounts are defrauded per year. (Most fraud detection domains are heavily imbalanced.)
  2. Medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ~0.4%).
  3. Disk drive failures are approximately ~1% per year.
  4. The conversion rates of online ads has been estimated to lie between 10-3 to 10-6.
  5. Factory production defect rates typically run about 0.1%.

Many of these domains are imbalanced because they are what I call needle in a haystackproblems, where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases.

Read on for some good advice on how to handle imbalanced data.

Related Posts

Comparing Keras In Python Versus R

Dmitry Kisler performs image classification using Keras in both Python and R: From the plots above, one can see that: the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is […]

Read More

Auto-Encoders And KernelML

Rohan Kotwani gives us an example where KernelML might be better than TensorFlow or PyTorch: So what’s the point of using KernelML? 1. The parameters in each layer can be non-linear 2. Each parameter can be sampled from a different random distribution 3. The parameters can be transformed to meet certain constraints 4. Network combinations […]

Read More


November 2017
« Oct Dec »