Everyone’s Data Is Dirty

Kevin Feasel

2017-11-16

Data

Chirag Shivalker hits the highlights on dirty data:

It might sound a bit abrupt, but clean data is a myth. If your data is dirty, so is everyone else’s. Enterprises are more than dependent on data these days, and it is going to stay the same in coming years. They need to collect data in order to analyze it, which necessarily will not be 100% clean, pristine, or perfect in nature.

Nearly all companies face the challenge of dirty data in the form of a lot of duplicates, incorrect fields, and missing values. This happens due to omnichannel data influx, followed by hundreds, if not thousands, of employees wrestling and torturing that data to derive professional outcomes and insights. Don’t forget that even the best of the data has that tendency to decay in few weeks.

The saying goes that any analytics project is about 80% data cleansing and feature extraction.  I’d say that number’s probably closer to 90-95%, and dirty data is a big part of that.

Related Posts

Finding The Real Character Set: Unicode And SQL Server Identifiers

Kevin Feasel

2018-04-09

Data

Solomon Rutzky wraps up his series on Unicode and regular identifiers: The question that I’m trying to answer is: what are the valid “letters” and “decimal numbers” from other national scripts? I tried using the online research tool “UnicodeSet”, but that gave slightly different results compared (using the “alphabetic” and “numeric_type = decimal” properties) to […]

Read More

Execution Plans And GDPR

Kevin Feasel

2018-03-13

Data

Grant Fritchey isn’t crazy when it comes to execution plans: Now, when you save an execution plan out to a file, you’re potentially transmitting PI data. It goes further. When you hard code values, PI is not just in the query. Those PI values can also be stored throughout the plan in various properties. So […]

Read More

Categories

November 2017
MTWTFSS
« Oct Dec »
 12345
6789101112
13141516171819
20212223242526
27282930