The Process Of Processing Data

I continue my series on launching a data science project:

This next category of data cleansing has to do with specific values.  I want to look at three particular sub-categories:  mislabeled data, mismatched data, and incorrect data.

Mislabeled data happens when the label is incorrect.  In a data science problem, the label is the thing that we are trying to explain or predict.  For example, in our data set, we want to predict SalaryUSD based on various inputs.  If somebody earns $50,000 per year but accidentally types 500000 instead of 50000, it can potentially affect our analysis.  If you can fix the label, this data becomes useful again, but if you cannot, it increases the error, which means we have a marginally lower capability for accurate prediction.

Mismatched data happens when we join together data from sources which should not have been joined together.  Let’s go back to the product title and UPC/MFC example.  As we fuss with the data to try to join together these two data sets, we might accidentally write a rule which joins a product + UPC to the wrong product + MFC.  We might be able to notice this with careful observation, but if we let it through, then we will once again thwart reality and introduce some additional error into our analysis.  We could also end up with the opposite problem, where we have missed connections and potentially drop useful data out of our sample.

Finally, I’m calling incorrect data where something other than the label is wrong.  For example, in the data professional salary survey, there’s a person who works 200 hours per week.  While I admire this person’s dedication and ability to create 1.25 extra days per week that the rest of us don’t experience, I think that person should have held out for more than just $95K/year.  I mean, if I had the ability to generate spare days, I’d want way more than that.

In this series, I’ve found myself writing a bit more than expected, so I’m breaking out theory from implementation.  This is the theory post, with implementation coming next week.

Related Posts

Classifying Texts With Naive Bayes

I continue my series on Naive Bayes with another hand-calculation post: Step two is, on the surface, pretty tough: how do we figure out if a set of words is a business phrase or a baseball phrase? We could try to think up a set of features. For example, how long is the phrase? How many unique […]

Read More

Where Machine Learning And Econometrics Collide

Dave Giles shares some thoughts on how machine learning and econometrics relate: What is Machine Learning (ML), and how does it differ from Statistics (and hence, implicitly, from Econometrics)? Those are big questions, but I think that they’re ones that econometricians should be thinking about. And if I were starting out in Econometrics today, I’d […]

Read More

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031