Everyone’s Data Is Dirty

Kevin Feasel

2017-11-16

Data

Chirag Shivalker hits the highlights on dirty data:

It might sound a bit abrupt, but clean data is a myth. If your data is dirty, so is everyone else’s. Enterprises are more than dependent on data these days, and it is going to stay the same in coming years. They need to collect data in order to analyze it, which necessarily will not be 100% clean, pristine, or perfect in nature.

Nearly all companies face the challenge of dirty data in the form of a lot of duplicates, incorrect fields, and missing values. This happens due to omnichannel data influx, followed by hundreds, if not thousands, of employees wrestling and torturing that data to derive professional outcomes and insights. Don’t forget that even the best of the data has that tendency to decay in few weeks.

The saying goes that any analytics project is about 80% data cleansing and feature extraction.  I’d say that number’s probably closer to 90-95%, and dirty data is a big part of that.

Related Posts

Azure Data Share

James Serra takes us through a new product announcement: A brand new product by Microsoft called Azure Data Share was recently announced. It is in public preview. To explain the product in short, any data which resides in Azure storage can be securely shared between a data provider and a data consumer. It does this by […]

Read More

Containers and Data

Randolph West argues that you should keep data and containers separated: Where it gets interesting is that the SQL Server container is also where the database files are stored by default. I raised a point (which Grant and others have already noted in the past) that persisted storage volumes allow us to throw away a SQL Server […]

Read More

Categories

November 2017
MTWTFSS
« Oct Dec »
 12345
6789101112
13141516171819
20212223242526
27282930