Tom Fawcett shows us how to handle a tricky classification problem:
The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue.
Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. Here are some examples:
- About 2% of credit card accounts are defrauded per year. (Most fraud detection domains are heavily imbalanced.)
- Medical screening for a condition is usually performed on a large population of people without the condition, to detect a small minority with it (e.g., HIV prevalence in the USA is ~0.4%).
- Disk drive failures are approximately ~1% per year.
- The conversion rates of online ads has been estimated to lie between 10-3 to 10-6.
- Factory production defect rates typically run about 0.1%.
Many of these domains are imbalanced because they are what I call needle in a haystackproblems, where machine learning classifiers are used to sort through huge populations of negative (uninteresting) cases to find the small number of positive (interesting, alarm-worthy) cases.
Read on for some good advice on how to handle imbalanced data.