It’s All ETL (Or ELT) In The End

Robin Moffatt notes that ETL (and ELT) doesn’t go away in a streaming world:

In the past we used ETL techniques purely within the data-warehousing and analytic space. But, if one considers why and what ETL is doing, it is actually a lot more applicable as a broader concept.

  • Extract: Data is available from a source system
  • Transform: We want to filter, cleanse or otherwise enrich this source data
  • Load: Make the data available to another application

There are two key concepts here:

  • Data is created by an application, and we want it to be available to other applications
  • We often want to process the data (for example, cleanse and apply business logic to it) before it is used

Thinking about many applications being built nowadays, particularly in the microservices and event-driven space, we recognize that what they do is take data from one or more systems, manipulate it and then pass it on to another application or system. For example, a fraud detection service will take data from merchant transactions, apply a fraud detection model and write the results to a store such as Elasticsearch for review by an expert. Can you spot the similarity to the above outline? Is this a microservice or ETL process?

Things like this are reason #1 why I expect data platform jobs (administrator and developer) to be around decades from now.  The set of tools expand, but the nature of the job remains similar.

Related Posts

Generating Load For Kafka With JMeter

Anup Shirolkar shows us a way to use JMeter to generate load for Apache Kafka clusters: The Anomalia Machina is going to require (at least!) one more thing as stated in the intro, loading with lots of data! Kafka is a log aggregation system and operates on a publish-subscribe mechanism. The Kafka cluster in Anomalia Machina […]

Read More

Data Science And Data Engineering In HDP 3.0

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers: We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different […]

Read More

Categories

September 2018
MTWTFSS
« Aug Oct »
 12
3456789
10111213141516
17181920212223
24252627282930