Press "Enter" to skip to content

Category: ETL / ELT

Streaming Data to Event Hubs via Kafka Connect and Debezium

Niels Berglund starts off a two-part sub-series within a series:

This post is the first of two looking at if and how we can stream data to Event Hubs from Debezium. Initially I had planned only one post covering this, but it turned out that the post would be too long, so therefore I split it in two.

It started with the post, How to Use Kafka Client with Azure Event Hubs. In that post, I looked at how the Kafka client can publish messages to – not only – Apache Kafka but also Azure Event Hubs. In the post, I said something like:

An interesting point here is that it is not only your Kafka applications that can publish to Event Hubs but any application that uses Kafka Client 1.0+, like Kafka Connect connectors!

Click through for the first part of this pairing.

Comments closed

Azure Data Factory Activity Queue Times

Meagan Longoria waits in line:

I’ve been working on a project to populate an Operational Data Store using Azure Data Factory (ADF). We have been seeking to tune our pipelines so we can import data every 15 minutes. After tuning the queries and adding useful indexes to target databases, we turned our attention to the ADF activity durations and queue times.

Data Factory places the pipeline activities into a queue, where they wait until they can be executed. If your queue time is long, it can mean that the Integration Runtime on which the activity is executing is waiting on resources (CPU, memory, networking, or otherwise), or that you need to increase the concurrent job limit.

Click through to see how you can calculate queue times across activities, pipelines, and data factories.

Comments closed

Preventing Concurrent Pipeline Execution in Azure Data Factory

Dave Ruijter and Laura de Bruin want to prevent concurrent runs of a pipeline:

For scheduled triggers, there is nothing out-of-the-box that can help you to prevent concurrent pipeline runs. For tumbling window triggers there is a maxConcurrency property, but keep in mind that this will create a queue/backlog of pipeline runs. It will not cancel any pipeline runs. It depends on your use case if you really want that behavior. 

Instead, the two look at a pair of designs and this post is all about the first one.

Comments closed

Deploying dbt on Databricks

Dave Eyler, et al, have a great announcement:

At Databricks, nothing makes us happier than making our users more productive, which is why we are delighted to announce a native adapter for dbt. It’s now easier than ever to develop robust data pipelines on Databricks using SQL.

dbt is a popular open source tool that lets a new breed of ‘analytics engineer’ build data pipelines using simple SQL. Everything is organized within directories, as plain text, making version control, deployment, and testability simple.

Click through for more information on how this works and how you can get the native adapter.

Comments closed

Building a Pipeline for External Data Sharing

Hope Foley has data to share:

I worked with a customer recently who had a need to share CSVs for an auditing situation.  They had a lot of external customers that they needed to collect CSVs from for the audit process.  There were a lot of discussions happening on how to best do it, whether we’d pull data from their environment or have them push them into theirs.  Folks weren’t sure on that so I tried to come up with something that would work for both. 

Read on for Hope’s solution to the problem.

Comments closed

Using the Fail Activity in Azure Data Factory

Rayis Imayev thinks about failure:

Recently, Microsoft introduced a new Fail activity (https://docs.microsoft.com/en-us/azure/data-factory/control-flow-fail-activity) in the Azure Data Factory (ADF) and I wondered about a reason to fail a pipeline in ADF when my internal being tries very hard to make the pipelines successful once and for all. Yes, I understand a documented explanation that this activity can help to “customize both its error message and error code”, but why?

Click through for Rayis’s take. I’ll just be here cracking jokes about how Fail activities are banned in my code because I expect it to have a positive outlook on life.

Comments closed

Building an ETL Pipeline with Airflow and Containers

Nikita Vasilev needs to move some data:

Obviously, we can use one of the many ready-made ETL systems that implement the functions of loading information into the corporate data warehouse. Informatica PowerCenter, Oracle Data Integrator, SAP Data Services, Oracle Warehouse Builder, Talend Open Studio, Pentaho are just a sliver of off-the-shelf solutions. However, when it comes to large volumes of data at high speeds and Big Data infrastructure already in place, boxed solutions fall flat to satisfy your needs.

Therefore, Big Data pipelines require something like Apache Airflow. It’s an open-source set of libraries for developing, planning, and monitoring workflows. Airflow is written in Python and allows you to create and configure task chains both visually with a clear web-GUI and to write Python program code.

Click through for an example using Airflow with AWS’s Elastic Container Service.

Comments closed

ETL from an API

Bill Fellows unravels a bad practice:

The direction for software as a service providers is to provide APIs to access their data instead of structured file exports. Which is a pity, as every SaaS system requires a bespoke data extract solution. I inheireted a solution that had an adverse pattern I’d like to talk about.

The next level of complexity, at least in the space Bill covers in the example, is what to do when the upstream provider changes their data, some with changes even a week later.

Comments closed

Updates to Azure Synapse Link

Aria Jelinek outlines the value of Azure Synapse Link:

New as of Ignite 2021, customers can optimize queries by setting custom partitions for their Azure Cosmos DB analytical store using keys that are commonly used as query filters. This compacts and optimizes the analytical data written to the partitioned store, resulting in better query performance even when working with a high volume of update or delete operations.

Azure Synapse Link is also now available for Azure Cosmos DB serverless accounts, expanding the integration to cover data from workloads with bursts of traffic or uncertain traffic patterns.

This post mostly covers the Dataverse and Cosmos DB integrations rather than the integration with SQL Server 2022.

One the whole, I like Azure Synapse Link for Cosmos DB and will probably like it for SQL Server 2022—maybe even a bit more. It does simplify the ELT process by taking care of the E and handling the first half of the L (landing into a staging table). Though if data’s going into a dedicated SQL pool, I do hope the people doing this understand that dedicated SQL pools are intended for Kimball-style data warehousing scenarios and there can be a considerable performance (and therefore price) hit if you simply replicate a bunch of stuff without subsequent transformation.

Comments closed