Category: ETL / ELT

We have some data we can query using the serverless SQL pools in Azure Synapse Analytics. For this blog post, I’m querying data that is stored in Azure Cosmos DB. Read the blog post How to Store Normalized SQL Server Data into Azure Cosmos DB to learn more about how that data got there.

Suppose I now want to read the data using Azure Data Factory. You can read data from Cosmos DB directly, but let’s pretend I want to do some transformations first using my favorite language: SQL. How can we do this?

Read on to learn how.

Comments closed

Granular Billing for Azure Data Factory

Published 2022-10-19 by Kevin Feasel

Chenye Charlie Zhu announces a new feature:

By default, Azure Data Factory reports lump sum charges for billing, meaning that at the factory level, we add up charges across all pipelines within a factory, and tell you how much you have spent on these pipelines. In many cases, these aggregate numbers should suffice. But in others, these numbers lack the clarity and transparency that we thrive to provide customers. For instance, if you are running data pipelines for multiple teams, you may want to determine the cost for each pipeline, for proper book-keeping and/or charge backs.
Now, Azure Data Factory will help you with this endeavor, with built-in per pipeline detailed billing view. Moreover, we built the feature on top of the Azure Billing and Cost Analysis platform, allowing you to stay with the cost and budget management tool that you are familiar with to identify spending trends and spot where overspending might have occurred.

Great if you have half a dozen pipelines. Probably less great if you have 500.

Comments closed

Interacting with Microsoft Graph API via Synapse

Published 2022-10-03 by Kevin Feasel

Paul Hernandez starts a new series:

In this and the next post I want to show you how to connect to the Microsoft Graph API, request some data, process it and store it in a database using Synapse Analytics.
This first post presents a sample use case, briefly introduces the Graph API, how to create a linked service to it, and how to start querying data. In the next post a sample Synapse pipeline will be described. The pipeline grabs some data and copies it into some target tables. Finally, I will create a sample query to showcase the newly imported data.

Because there’s some potential confusion to people, Graph API is completely different from the idea of graph databases.

Comments closed

Real-Time Streaming ETL with Kafka and Debezium

Published 2022-09-27 by Kevin Feasel

Dursun Koc doesn’t have time for batched ETL:

Debezium is not extracting data using SQL. It uses database log files to track the changes in the database, so it has minimum effect on the source system. For more information about Debezium, please visit their website.
After the data is extracted, we need Kafka Connect to stream it into Apache Kafka in order to play with it and reshape it as we required. And we will be using ksqlDB in order to reshape the raw data in a way we are required in the target system. Let’s consider a simple ordering system database in which we have a customer table, a product table, and an orders table, as shown below.

Read on for an overview as well as a link to the GitHub repo where you can try this all out.

Comments closed

Using the ShortCircuitOperator in Airflow

Published 2022-09-09 by Kevin Feasel

Lior Gavish shows off a useful operator in Apache Airflow:

But what happens when Airflow testing doesn’t catch all of your bad data? What if “unknown unknown” data quality issues fall through the cracks and affect your Airflow jobs?
One helpful but underutilized solution is to leverage the Airflow ShortCircuitOperator to create data circuit breakers to prevent bad data from flowing across your data pipelines.
Data circuit breakers are powerful, but as with most data quality tactics, the nuances of how they are implemented are critical. Otherwise, you can make a bad problem worse.

Read on to learn more about the operator and how you can use it. The code block images are a bit fuzzy but still readable enough. It might be a little clearer on the original post.

Comments closed

Data Modification with Synapse Link for SQL Server 2022

Published 2022-08-31 by Kevin Feasel

Kevin Chant changes some data:

In this post I want to cover some things that happen internally when you do updates and deletes with Azure Synapse Link for SQL Server 2022 whilst it is running.
Because recently somebody asked if Azure Synapse Link for SQL Server 2022 captures updates and deletes after they had read a previous post. Where I covered my initial tests for Azure Synapse Link for SQL Server 2022.
Anyway, short answer is that Azure Synapse Link for SQL Server 2022 captures updates and deletes. In this post I will go into more detail about some of the things that appear to happen along the way.

Click through for Kevin’s tests and what the results look like.

Comments closed

Power Automate and Dataset-Driven Power BI Subscriptions

Published 2022-08-31 by Kevin Feasel

Dan English follows up on a prior topic:

In the last post I went over using Power Automate to perform a data driven report subscription using a Paginated report referencing an AAS database. The flow referenced an Excel file with the information to make the process data driven and generate 2000 PDF files that could then be emailed to users. In the flow the PDF files were simply placed in a OneDrive folder for testing purposes to validate the flow would run as expected and to review the metrics after the fact to evaluate the impact of running the process.
For the follow up there were two items that I wanted to compare against the original flow
1. Moving the AAS database being referenced to a Power BI dataset hosted in the same capacity as the Paginated report
2. Using a Power BI report instead of a Paginated report
In this post I will cover the first comparison.

Check out what changes and what stays the same between using Azure Analysis Services and Power BI-hosted datasets.

Comments closed

Azure Synapse Link for SQL Server 2022 and File Analysis

Published 2022-08-23 by Kevin Feasel

Kevin Chant digs into Azure Synapse Link for SQL Server 2022:

In this post I want to cover some file tests for Azure Synapse Link for SQL Server 2022 that I performed.
Because a while back I spotted something interesting whilst I was doing some initial tests for Azure Synapse Link for SQL Server 2022.
Which is when you add new data after the initial load that a new folder called ‘ChangeData’ appears in the storage account container. I noticed that the new file containing the insert was a comma separated value (csv) file. Whereas the table used for the initial load was a parquet file.

Is there a method to this madness? Click through to see Kevin’s tell-all story.

Comments closed

Adding an Existing Data Factory to GitHub

Published 2022-08-17 by Kevin Feasel

Andy Leonard has a three-parter for us. Part 1 shows you how to create a GitHub account and repo:

The unabridged topic of source control with github is beyond the scope of this post. There are a number of ways to accomplish the tasks described in this post and series. I welcome your suggestions in the comments.
This post is written to help Azure Data Factory developers get started using github.

Part 2 connects a Data Factory to the repository:

For the purposes of this demo, accept the defaults for “Publish branch” and “Root folder.” Check the “Import existing resources to repository” checkbox under the “Import existing resource” property, select the main branch in the “Import resource into this branch” property, and then click the “Apply” button:

Part 3 handles changes:

Applying what we’ve configured and learned thus far, let’s put this to work in a code-management workflow.
When it’s time to make a change, first create a new branch. I can hear some of you thinking, “Why, Andy? Why create a new branch?” That’s an excellent question. I am so glad you asked! Think of the new branch as a temporary copy of the current state of my Azure Data Factory.

This series works from the assumption that you don’t have any real experience with Git (or GitHub) for source control, and maybe not much source control experience at all.

Comments closed

Database-Driven Parameterization for Synapse Pipelines

Published 2022-08-11 by Kevin Feasel

Paul Hernandez does some configuring:

Particularly in Synapse, there are even no global parameters like in Azure Data Factory.
When you want to move your development to another environment, typically CI/CDs pipelines are used. These pipelines consume an ARM template together with its parameter file to create a workspace in a target environment. The parameters can be overriding in the CD pipeline as explain here: https://techcommunity.microsoft.com/t5/data-architecture-blog/ci-cd-in-azure-synapse-analytics-part-4-the-release-pipeline/ba-p/2034434
Even so, I have not found a proper way to change the values of a pipeline parameter (the same for data flows and datasets parameters). I saw some custom parameters manipulation to set the default value of a parameter and then deploy it without any value, or even JSON manipulation with PowerShell (the dark side for me).

Read on for an alternative solution which does the job well.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31