Change Data Capture With Databricks Delta

Ameet Kini and Denny Lee show how to use Databricks Delta to handle change data capture from different processes:

With Databricks Delta, the CDC pipeline is now streamlined and can be refreshed more frequently: Informatica => S3 => Spark Hourly Batch Job => Delta. In this scenario, Informatica writes change sets directly to S3 using Informatica’s Parquet writer. Databricks jobs run at the desired sub-nightly refresh rate (e.g., every 15 min, hourly, every 3 hours, etc.) to read these change sets and update the target Databricks Delta table.

With minor changes, this pipeline has also been adapted to read CDC records from Kafka, so the pipeline there would look like Kafka => Spark => Delta. In the rest of this section, we elaborate on this process, and how we use Databricks Delta as a sink for their CDC workflows.

With one of our customers, we implemented these CDC techniques on their largest and most frequently refreshed ETL pipeline. In this customer scenario, Informatica writes a change set to S3 for each of its 65 tables that have any changes every 15 minutes.   While the change sets themselves are fairly small (< 1000 records), their target tables can become quite large. Out of the 65 tables, roughly half a dozen are in the 50m-100m row range, and the rest are smaller than 50m. In Oracle, this pipeline would have run every 15 minutes, keeping in sync with Informatica. In Databricks Delta, we thought this would take close to an hour due to S3 latencies but ended up being pleasantly surprised with a 30 and even 15-minute refresh period depending on cluster size.

Click through for the rest of the story.

Related Posts

Machine Learning and Delta Lake

Brenner Heintz and Denny Lee walk us through solving data engineering problems with Delta Lake: As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to […]

Read More

Cloudera Stream Processing

Dinesh Chandrasekhar announces the new iteration of Cloudera’s streaming data processor: Cloudera Stream Processing (CSP) is a product within the Cloudera DataFlow platform that packs Kafka along with some key streaming components that empower enterprises to handle some of the most complex and sophisticated streaming use cases. CSP provides advanced messaging, real-time processing and analytics on […]

Read More

Categories

October 2018
MTWTFSS
« Sep Nov »
1234567
891011121314
15161718192021
22232425262728
293031