To the junior data scientist, it looks like each new dataset comes with a new set of challenges. It seems that you can not automate data cleaning. To decisions makers and stakeholders, this problem is so remote to them that they don’t even know the amount of resources wasted on this. To them, it seems obvious that automation is the way to go, but they may underestimate the challenges. It is usually not a high priority in many organizations, despite how much money it costs.
Yet, there are at most a few dozens of issues that come with data cleaning. Not a few thousands, not a few hundreds. You can catalog them and address all of them at once with a piece of code. One that you can reuse each time when you face a new data set. I describe here the main issues and how to address them. Automating the data cleaning step can save you a lot of time, and eliminate boring, repetitive tasks to make your data scientists happier.
Click through for Vincent’s thoughts and recommendations.