Press "Enter" to skip to content

Handling Imbalanced Data

Tom Fawcett shows us how to handle a tricky classification problem:

The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue.

Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. —Here are some examples:

  1. About 2% of credit card accounts are defrauded per year. (Most fraud detection domains are heavily imbalanced.)
  2. Medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ~0.4%).
  3. Disk drive failures are approximately ~1% per year.
  4. The conversion rates of online ads has been estimated to lie between 10-3 to 10-6.
  5. Factory production defect rates typically run about 0.1%.

Many of these domains are imbalanced because they are what I call needle in a haystackproblems, where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases.

Read on for some good advice on how to handle imbalanced data.