Imbalanced Data In R

Rathnadevi Manivannan explains how to deal with imbalanced data using R:

Imbalanced data refers to classification problems where one class outnumbers other class by a substantial proportion. Imbalanced classification occurs more frequently in binary classification than in multi-level classification. For example, extreme imbalanced data can be seen in banking or financial data where majority credit card uses are acceptable and very few credit card uses are fraudulent.

With an imbalanced dataset, the information required to make an accurate prediction about the minority class cannot be obtained using an algorithm. So, it is recommended to use balanced classification dataset.

Rathnadevi uses fraudulent transactions for his sample, but medical diagnoses is also a good example:  suppose 1 person in 10,000 has a particular disease.  You’re 99.99% right if you just say nobody has the disease, but that’s a rather unhelpful model.

Related Posts

Inline Operators In R With wrapr

John Mount shows how to use inline operators in R with the wrapr package: The above code is assuming you have the wrapr package attached via already having run library('wrapr'). Notice we picked R-related operator names. We stayed away from overloading the + operator, as the arithmetic operators are somewhat special in how they dispatch in R. The goal wasn’t […]

Read More

Feature And Text Classification Using Naive Bayes In R

I wrap up my series on the Naive Bayes class of algorithms, finally writing some code along the way: Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, […]

Read More


October 2017
« Sep Nov »