Press "Enter" to skip to content

Dealing with Imbalanced Class Data for Image Classification

Alexander Billington needs more beta carotene:

Image classification is a standard computer vision task and involves training a model to assign a label to a given image, such as a model to classify images of different root vegetables. A big problem with classification is bias, and the models favouring a particular image class above the others. A common cause of this can be dataset imbalance, and it is often hard to spot as a model trained on an imbalanced dataset can often still have good accuracy. E.g. if there are 1000 images in the test dataset, 950 potatoes and 50 carrots and the model predicted all 1000 images to be potatoes it would still have 95% accuracy. This is also an example of why more metrics than accuracy should be considered… but let’s leave that discussion for another day.

Click through for several techniques you can use to balance out classes, with a focus on image classification. Undersampling is almost always a no-go for me, though I am much fonder of the other techniques.