Press "Enter" to skip to content

Category: ETL / ELT

Executing SQL Statements in Azure Data Factory

Abhishek Narain announces a pretty nice improvement to Azure Data Factory and Synapse Pipelines:

We are introducing a Script activity in pipelines that provide the ability to execute single or multiple SQL statements.  

Using the script activity, you can execute common operations with Data Manipulation Language (DML), and Data Definition Language (DDL). DML statements like SELECT, UPDATE, and INSERT let users retrieve, store, modify, delete, insert and update data in the database. DDL statements like CREATE, ALTER, and DROP allow a database manager to create, modify, and remove database objects such as tables, indexes, and users.

Be sure to read the limitations at the bottom, however.

Comments closed

Creating an Azure Integration Runtime

Andy Leonard builds out an Azure Integration Runtime:

Many Azure Data Factory developers recommend creating an Azure Integration Runtime for use with Mapping Data Flows. Why? One reason is you cannot configure all the options in the default AutoResolveIntegrationRuntime supplied when an Azure Data Factory instance is provisioned.

At the time of this writing, it’s not obvious how one creates an Azure Integration Runtime. You would think creating an integration runtime would begin with:

It turns out to be a little trickier than you might first expect.

Comments closed

From Cosmos DB to Dedicated SQL Pools via Synapse Link

Jovan Popovic shows off Azure Synapse Link:

At the time of writing this article, the dedicated SQL pool doesn’t have the ability to read data from CosmosDB/Dataverse using the Synapse link. There are scenarios where you would need to use CosmosDB data in your dedicated SQL pool, so you would need to find a way how to load data. In theory, you could create an ADF pipeline that reads data from CosmosDB or Dataverse and store data in the dedicated SQL pool as a target. This might be a problem if your Pipeline is reading data directly from CosmosDB account because it might impact both operational workload performance and cost. The analytical storage is the recommended location that you should use to fetch all data from CosmosDB/Dataverse.

In this post, I will describe how to use a two-step approach where you export your data using the serverless SQL pool via Synapse link into Azure Data Lake storage, and then load data into the dedicated SQL pool table. This process is shown in the following figure:

A couple of weeks back, I wrote about another method of doing this through the Spark pool. Now you have two options.

Comments closed

Simple Mapping Data Flows in Synapse

Joshuha Owen announces a new feature:

This week, we are excited to announce the public preview for Map Data, a new feature for Azure Synapse Analytics and Database Templates! The Map Data tool is a guided process to help users create ETL mappings and mapping data flows from their source data to Synapse lake database tables without writing code. This experience will help you get started with transformations into your Synapse Lake database quickly but still give you the power of Mapping Data Flows.

This process starts with the user choosing the destination tables in Synapse lake databases and then mapping their source data into these tables. We will be following up with a demo video shortly.

Click through for more details on how it works.

Comments closed

Streaming Data to Event Hubs via Kafka Connect and Debezium

Niels Berglund starts off a two-part sub-series within a series:

This post is the first of two looking at if and how we can stream data to Event Hubs from Debezium. Initially I had planned only one post covering this, but it turned out that the post would be too long, so therefore I split it in two.

It started with the post, How to Use Kafka Client with Azure Event Hubs. In that post, I looked at how the Kafka client can publish messages to – not only – Apache Kafka but also Azure Event Hubs. In the post, I said something like:

An interesting point here is that it is not only your Kafka applications that can publish to Event Hubs but any application that uses Kafka Client 1.0+, like Kafka Connect connectors!

Click through for the first part of this pairing.

Comments closed

Azure Data Factory Activity Queue Times

Meagan Longoria waits in line:

I’ve been working on a project to populate an Operational Data Store using Azure Data Factory (ADF). We have been seeking to tune our pipelines so we can import data every 15 minutes. After tuning the queries and adding useful indexes to target databases, we turned our attention to the ADF activity durations and queue times.

Data Factory places the pipeline activities into a queue, where they wait until they can be executed. If your queue time is long, it can mean that the Integration Runtime on which the activity is executing is waiting on resources (CPU, memory, networking, or otherwise), or that you need to increase the concurrent job limit.

Click through to see how you can calculate queue times across activities, pipelines, and data factories.

Comments closed

Preventing Concurrent Pipeline Execution in Azure Data Factory

Dave Ruijter and Laura de Bruin want to prevent concurrent runs of a pipeline:

For scheduled triggers, there is nothing out-of-the-box that can help you to prevent concurrent pipeline runs. For tumbling window triggers there is a maxConcurrency property, but keep in mind that this will create a queue/backlog of pipeline runs. It will not cancel any pipeline runs. It depends on your use case if you really want that behavior. 

Instead, the two look at a pair of designs and this post is all about the first one.

Comments closed

Deploying dbt on Databricks

Dave Eyler, et al, have a great announcement:

At Databricks, nothing makes us happier than making our users more productive, which is why we are delighted to announce a native adapter for dbt. It’s now easier than ever to develop robust data pipelines on Databricks using SQL.

dbt is a popular open source tool that lets a new breed of ‘analytics engineer’ build data pipelines using simple SQL. Everything is organized within directories, as plain text, making version control, deployment, and testability simple.

Click through for more information on how this works and how you can get the native adapter.

Comments closed