Data Engineering Remains As Important As Ever

Prashanth Southekal has good news for ETL developers:

While many companies have embarked on data analytics initiatives, only a few have been successful. Studies have shown that over 70% of data analytics programs fail to realize their full potential and over 80% of the digital transformation initiatives fail. While there are many reasons that affect successful deployment of data analytics, one fundamental reason is lack of good quality data. However, many business enterprises realize this and invest considerable time and effort in data cleansing and remediation; technically known as data engineering. It is estimated that about 60 to 70% of the effort in data analytics is on data engineering. Given that data quality is an essential requirement for analytics, there are 5 key reasons on why data analytics is heavy on data engineering.

1.Different systems and technology mechanisms to integrate data.

Business systems are designed and implemented for a purpose; mainly for recording business transactions. The mechanisms for data capture in Business systems such as ERP is batch/discrete data while in the SCADA/IoT Field Systems it is for continuous/time-series data. This means that these business systems store diverse data types caused by the velocity, volume, and variety dimensions in the data. Hence the technology (including the database itself) to capture data is varied and complex.  And when you are trying to integrate data from these diverse systems from different vendors, the metadata model varies resulting in data integration challenges.

That 60-70% on data engineering is probably a moderate underestimate.

Related Posts

Monte Carlo Simulation in Python

Kristian Larsen has a couple of posts on Monte Carlo style simulation in Python. First up is a post which covers how to generate data from different distributions: One method that is very useful for data scientist/data analysts in order to validate methods or data is Monte Carlo simulation. In this article, you learn how […]

Read More

An Example of p-Hacking

Vincent Granville explains why using p-values for model-worthiness can lead you to a bad outcome: Recently, p-values have been criticized and even banned by some journals, because they are used by researchers, who cherry-pick observations and repeat experiments until they obtain a p-value worth publishing to obtain grant money, get tenure, or for political reasons.  Even the […]

Read More


July 2018
« Jun Aug »