ETL / ELT – Page 17 – Curated SQL

Azure Data Factory Activity Queue Times

Published 2022-01-06 by Kevin Feasel

I’ve been working on a project to populate an Operational Data Store using Azure Data Factory (ADF). We have been seeking to tune our pipelines so we can import data every 15 minutes. After tuning the queries and adding useful indexes to target databases, we turned our attention to the ADF activity durations and queue times.
Data Factory places the pipeline activities into a queue, where they wait until they can be executed. If your queue time is long, it can mean that the Integration Runtime on which the activity is executing is waiting on resources (CPU, memory, networking, or otherwise), or that you need to increase the concurrent job limit.

Click through to see how you can calculate queue times across activities, pipelines, and data factories.

Comments closed

Changing the Connected Git Repo in ADF

Published 2021-12-30 by Kevin Feasel

Meagan Longoria finds an oddity with Azure Data Factory:

When I arrived at the Git configuration page, I found the Disconnect button to be disabled. This was confusing as I am an Owner and Data Factory Contributor on this resource.

Read on to see how Meagan was able to fix this issue, and also the underlying cause of the problem.

Comments closed

Preventing Concurrent Pipeline Execution in Azure Data Factory

Published 2021-12-15 by Kevin Feasel

Dave Ruijter and Laura de Bruin want to prevent concurrent runs of a pipeline:

For scheduled triggers, there is nothing out-of-the-box that can help you to prevent concurrent pipeline runs. For tumbling window triggers there is a maxConcurrency property, but keep in mind that this will create a queue/backlog of pipeline runs. It will not cancel any pipeline runs. It depends on your use case if you really want that behavior.

Instead, the two look at a pair of designs and this post is all about the first one.

Comments closed

Deploying dbt on Databricks

Published 2021-12-09 by Kevin Feasel

Dave Eyler, et al, have a great announcement:

At Databricks, nothing makes us happier than making our users more productive, which is why we are delighted to announce a native adapter for dbt. It’s now easier than ever to develop robust data pipelines on Databricks using SQL.
dbt is a popular open source tool that lets a new breed of ‘analytics engineer’ build data pipelines using simple SQL. Everything is organized within directories, as plain text, making version control, deployment, and testability simple.

Click through for more information on how this works and how you can get the native adapter.

Comments closed

Building a Pipeline for External Data Sharing

Published 2021-12-09 by Kevin Feasel

Hope Foley has data to share:

I worked with a customer recently who had a need to share CSVs for an auditing situation. They had a lot of external customers that they needed to collect CSVs from for the audit process. There were a lot of discussions happening on how to best do it, whether we’d pull data from their environment or have them push them into theirs. Folks weren’t sure on that so I tried to come up with something that would work for both.

Read on for Hope’s solution to the problem.

Comments closed

Using the Fail Activity in Azure Data Factory

Published 2021-12-01 by Kevin Feasel

Rayis Imayev thinks about failure:

Recently, Microsoft introduced a new Fail activity (https://docs.microsoft.com/en-us/azure/data-factory/control-flow-fail-activity) in the Azure Data Factory (ADF) and I wondered about a reason to fail a pipeline in ADF when my internal being tries very hard to make the pipelines successful once and for all. Yes, I understand a documented explanation that this activity can help to “customize both its error message and error code”, but why?

Click through for Rayis’s take. I’ll just be here cracking jokes about how Fail activities are banned in my code because I expect it to have a positive outlook on life.

Comments closed

Building an ETL Pipeline with Airflow and Containers

Published 2021-11-30 by Kevin Feasel

Nikita Vasilev needs to move some data:

Obviously, we can use one of the many ready-made ETL systems that implement the functions of loading information into the corporate data warehouse. Informatica PowerCenter, Oracle Data Integrator, SAP Data Services, Oracle Warehouse Builder, Talend Open Studio, Pentaho are just a sliver of off-the-shelf solutions. However, when it comes to large volumes of data at high speeds and Big Data infrastructure already in place, boxed solutions fall flat to satisfy your needs.
Therefore, Big Data pipelines require something like Apache Airflow. It’s an open-source set of libraries for developing, planning, and monitoring workflows. Airflow is written in Python and allows you to create and configure task chains both visually with a clear web-GUI and to write Python program code.

Click through for an example using Airflow with AWS’s Elastic Container Service.

Comments closed

ETL from an API

Published 2021-11-19 by Kevin Feasel

Bill Fellows unravels a bad practice:

The direction for software as a service providers is to provide APIs to access their data instead of structured file exports. Which is a pity, as every SaaS system requires a bespoke data extract solution. I inheireted a solution that had an adverse pattern I’d like to talk about.

The next level of complexity, at least in the space Bill covers in the example, is what to do when the upstream provider changes their data, some with changes even a week later.

Comments closed

Updates to Azure Synapse Link

Published 2021-11-12 by Kevin Feasel

Aria Jelinek outlines the value of Azure Synapse Link:

New as of Ignite 2021, customers can optimize queries by setting custom partitions for their Azure Cosmos DB analytical store using keys that are commonly used as query filters. This compacts and optimizes the analytical data written to the partitioned store, resulting in better query performance even when working with a high volume of update or delete operations.
Azure Synapse Link is also now available for Azure Cosmos DB serverless accounts, expanding the integration to cover data from workloads with bursts of traffic or uncertain traffic patterns.

This post mostly covers the Dataverse and Cosmos DB integrations rather than the integration with SQL Server 2022.

One the whole, I like Azure Synapse Link for Cosmos DB and will probably like it for SQL Server 2022—maybe even a bit more. It does simplify the ELT process by taking care of the E and handling the first half of the L (landing into a staging table). Though if data’s going into a dedicated SQL pool, I do hope the people doing this understand that dedicated SQL pools are intended for Kimball-style data warehousing scenarios and there can be a considerable performance (and therefore price) hit if you simply replicate a bunch of stuff without subsequent transformation.

Comments closed

Automating Single Table Refresh with Azure Data Factory and Azure Automation

Published 2021-10-29 by Kevin Feasel

Marc Lelijveld wants to refresh a single table:

Back in February, I wrote a blog on how you can trigger a single table to refresh in your Power BI data model. This blog described how you can achieve this goal using a PowerShell script and the ASCmd cmdlets for Analysis Services, which also works for Power BI Premium. In the wrap-up of that blog, I promised to follow-up with a blog on how to achieve the same goal with Azure Data Factory. It took a little bit longer than expected to finalize this post, but here it is!
In this blog, co-authored by my colleague Paulien van Eijk, we will describe how you can automate your single table refresh in the Power BI Service, including all dependencies with downstream dataflows using Azure Data Factory and Azure Automation. All this is based on real life scenarios and a solution build in collaboration between Dave Ruijter, Paulien and me.

Read on for Marc and Paulien’s solution.

Comments closed

Category: ETL / ELT