Matthew Mayo shares a few tips:
Imperfect data is the norm rather than the exception in machine learning. Comparably common is the binary class imbalance when the classes in a trained data remains majority/minority class, or is moderately skewed. Imbalanced data can undermine a machine learning model by producing model selection biases. Therefore in the interest of model performance and equitable representation, solving the problem of imbalanced data during training and evaluation is paramount.
This article will define imbalanced data, resampling strategies as solution, appropriate evaluation metrics, kinds of algorithmic approaches, and the utility of synthetic data and data augmentation to address this imbalance.
Read on for five recommendations, starting with what you should know and then offering up four options for what you can do.