I continue my series on launching a data science project:
This next category of data cleansing has to do with specific values. I want to look at three particular sub-categories: mislabeled data, mismatched data, and incorrect data.
Mislabeled data happens when the label is incorrect. In a data science problem, the label is the thing that we are trying to explain or predict. For example, in our data set, we want to predict SalaryUSD based on various inputs. If somebody earns $50,000 per year but accidentally types 500000 instead of 50000, it can potentially affect our analysis. If you can fix the label, this data becomes useful again, but if you cannot, it increases the error, which means we have a marginally lower capability for accurate prediction.
Mismatched data happens when we join together data from sources which should not have been joined together. Let’s go back to the product title and UPC/MFC example. As we fuss with the data to try to join together these two data sets, we might accidentally write a rule which joins a product + UPC to the wrong product + MFC. We might be able to notice this with careful observation, but if we let it through, then we will once again thwart reality and introduce some additional error into our analysis. We could also end up with the opposite problem, where we have missed connections and potentially drop useful data out of our sample.
Finally, I’m calling incorrect data where something other than the label is wrong. For example, in the data professional salary survey, there’s a person who works 200 hours per week. While I admire this person’s dedication and ability to create 1.25 extra days per week that the rest of us don’t experience, I think that person should have held out for more than just $95K/year. I mean, if I had the ability to generate spare days, I’d want way more than that.
In this series, I’ve found myself writing a bit more than expected, so I’m breaking out theory from implementation. This is the theory post, with implementation coming next week.