ETL / ELT – Page 20 – Curated SQL

Creating an Azure Integration Runtime

Published 2022-02-17 by Kevin Feasel

Andy Leonard builds out an Azure Integration Runtime:

Many Azure Data Factory developers recommend creating an Azure Integration Runtime for use with Mapping Data Flows. Why? One reason is you cannot configure all the options in the default AutoResolveIntegrationRuntime supplied when an Azure Data Factory instance is provisioned.
At the time of this writing, it’s not obvious how one creates an Azure Integration Runtime. You would think creating an integration runtime would begin with:

It turns out to be a little trickier than you might first expect.

Comments closed

From Cosmos DB to Dedicated SQL Pools via Synapse Link

Published 2022-02-04 by Kevin Feasel

Jovan Popovic shows off Azure Synapse Link:

At the time of writing this article, the dedicated SQL pool doesn’t have the ability to read data from CosmosDB/Dataverse using the Synapse link. There are scenarios where you would need to use CosmosDB data in your dedicated SQL pool, so you would need to find a way how to load data. In theory, you could create an ADF pipeline that reads data from CosmosDB or Dataverse and store data in the dedicated SQL pool as a target. This might be a problem if your Pipeline is reading data directly from CosmosDB account because it might impact both operational workload performance and cost. The analytical storage is the recommended location that you should use to fetch all data from CosmosDB/Dataverse.
In this post, I will describe how to use a two-step approach where you export your data using the serverless SQL pool via Synapse link into Azure Data Lake storage, and then load data into the dedicated SQL pool table. This process is shown in the following figure:

A couple of weeks back, I wrote about another method of doing this through the Spark pool. Now you have two options.

Comments closed

Simple Mapping Data Flows in Synapse

Published 2022-01-18 by Kevin Feasel

Joshuha Owen announces a new feature:

This week, we are excited to announce the public preview for Map Data, a new feature for Azure Synapse Analytics and Database Templates! The Map Data tool is a guided process to help users create ETL mappings and mapping data flows from their source data to Synapse lake database tables without writing code. This experience will help you get started with transformations into your Synapse Lake database quickly but still give you the power of Mapping Data Flows.
This process starts with the user choosing the destination tables in Synapse lake databases and then mapping their source data into these tables. We will be following up with a demo video shortly.

Click through for more details on how it works.

Comments closed

Streaming Data to Event Hubs via Kafka Connect and Debezium

Published 2022-01-11 by Kevin Feasel

Niels Berglund starts off a two-part sub-series within a series:

This post is the first of two looking at if and how we can stream data to Event Hubs from Debezium. Initially I had planned only one post covering this, but it turned out that the post would be too long, so therefore I split it in two.
It started with the post, How to Use Kafka Client with Azure Event Hubs. In that post, I looked at how the Kafka client can publish messages to – not only – Apache Kafka but also Azure Event Hubs. In the post, I said something like:
An interesting point here is that it is not only your Kafka applications that can publish to Event Hubs but any application that uses Kafka Client 1.0+, like Kafka Connect connectors!

Click through for the first part of this pairing.

Comments closed

Azure Data Factory Activity Queue Times

Published 2022-01-06 by Kevin Feasel

Meagan Longoria waits in line:

I’ve been working on a project to populate an Operational Data Store using Azure Data Factory (ADF). We have been seeking to tune our pipelines so we can import data every 15 minutes. After tuning the queries and adding useful indexes to target databases, we turned our attention to the ADF activity durations and queue times.
Data Factory places the pipeline activities into a queue, where they wait until they can be executed. If your queue time is long, it can mean that the Integration Runtime on which the activity is executing is waiting on resources (CPU, memory, networking, or otherwise), or that you need to increase the concurrent job limit.

Click through to see how you can calculate queue times across activities, pipelines, and data factories.

Comments closed

Changing the Connected Git Repo in ADF

Published 2021-12-30 by Kevin Feasel

Meagan Longoria finds an oddity with Azure Data Factory:

When I arrived at the Git configuration page, I found the Disconnect button to be disabled. This was confusing as I am an Owner and Data Factory Contributor on this resource.

Read on to see how Meagan was able to fix this issue, and also the underlying cause of the problem.

Comments closed

Preventing Concurrent Pipeline Execution in Azure Data Factory

Published 2021-12-15 by Kevin Feasel

Dave Ruijter and Laura de Bruin want to prevent concurrent runs of a pipeline:

For scheduled triggers, there is nothing out-of-the-box that can help you to prevent concurrent pipeline runs. For tumbling window triggers there is a maxConcurrency property, but keep in mind that this will create a queue/backlog of pipeline runs. It will not cancel any pipeline runs. It depends on your use case if you really want that behavior.

Instead, the two look at a pair of designs and this post is all about the first one.

Comments closed

Deploying dbt on Databricks

Published 2021-12-09 by Kevin Feasel

Dave Eyler, et al, have a great announcement:

At Databricks, nothing makes us happier than making our users more productive, which is why we are delighted to announce a native adapter for dbt. It’s now easier than ever to develop robust data pipelines on Databricks using SQL.
dbt is a popular open source tool that lets a new breed of ‘analytics engineer’ build data pipelines using simple SQL. Everything is organized within directories, as plain text, making version control, deployment, and testability simple.

Click through for more information on how this works and how you can get the native adapter.

Comments closed

Building a Pipeline for External Data Sharing

Published 2021-12-09 by Kevin Feasel

Hope Foley has data to share:

I worked with a customer recently who had a need to share CSVs for an auditing situation. They had a lot of external customers that they needed to collect CSVs from for the audit process. There were a lot of discussions happening on how to best do it, whether we’d pull data from their environment or have them push them into theirs. Folks weren’t sure on that so I tried to come up with something that would work for both.

Read on for Hope’s solution to the problem.

Comments closed

Using the Fail Activity in Azure Data Factory

Published 2021-12-01 by Kevin Feasel

Rayis Imayev thinks about failure:

Recently, Microsoft introduced a new Fail activity (https://docs.microsoft.com/en-us/azure/data-factory/control-flow-fail-activity) in the Azure Data Factory (ADF) and I wondered about a reason to fail a pipeline in ADF when my internal being tries very hard to make the pipelines successful once and for all. Yes, I understand a documented explanation that this activity can help to “customize both its error message and error code”, but why?

Click through for Rayis’s take. I’ll just be here cracking jokes about how Fail activities are banned in my code because I expect it to have a positive outlook on life.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Category: ETL / ELT