The Process Of Processing Data

I continue my series on launching a data science project:

This next category of data cleansing has to do with specific values.  I want to look at three particular sub-categories:  mislabeled data, mismatched data, and incorrect data.

Mislabeled data happens when the label is incorrect.  In a data science problem, the label is the thing that we are trying to explain or predict.  For example, in our data set, we want to predict SalaryUSD based on various inputs.  If somebody earns $50,000 per year but accidentally types 500000 instead of 50000, it can potentially affect our analysis.  If you can fix the label, this data becomes useful again, but if you cannot, it increases the error, which means we have a marginally lower capability for accurate prediction.

Mismatched data happens when we join together data from sources which should not have been joined together.  Let’s go back to the product title and UPC/MFC example.  As we fuss with the data to try to join together these two data sets, we might accidentally write a rule which joins a product + UPC to the wrong product + MFC.  We might be able to notice this with careful observation, but if we let it through, then we will once again thwart reality and introduce some additional error into our analysis.  We could also end up with the opposite problem, where we have missed connections and potentially drop useful data out of our sample.

Finally, I’m calling incorrect data where something other than the label is wrong.  For example, in the data professional salary survey, there’s a person who works 200 hours per week.  While I admire this person’s dedication and ability to create 1.25 extra days per week that the rest of us don’t experience, I think that person should have held out for more than just $95K/year.  I mean, if I had the ability to generate spare days, I’d want way more than that.

In this series, I’ve found myself writing a bit more than expected, so I’m breaking out theory from implementation.  This is the theory post, with implementation coming next week.

Related Posts

Using word2vec To Model User Behavior

Nishan Subedi walks us through an Etsy project to model user journeys via semantic embedding techniques: We initially started training the embeddings as a Skip-gram model with negative sampling (NEG as outlined in the original word2vec paper) method. The Skip-gram model performs better than the Continuous Bag Of Words (CBOW) model for larger vocabularies. It […]

Read More

Testing Spatial Equilibrium Concepts With tidycensus

Ignacio Sarmiento Barbieri walks us through the concept of spatial equilibrium and tests using data from the tidycensus package: Let’s take the model to the data and reproduce figures 2.1. and 2.2 of “Cities, Agglomeration, and Spatial Equilibrium”. The focus are two cities, Chicago and Boston. These cities are chosen because both differ in how easy […]

Read More

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031