Press "Enter" to skip to content

Category: ETL / ELT

Cosmos DB to Data Explorer Synapse Link

Vicent-Philippe Lauzon makes an announcement:

We recently made our new Kusto data connection available in public preview:  Cosmos DB to Azure Data Explorer Synapse Link.

This does look like a marketing-heavy announcement but the short version is that you can ingest data from Cosmos DB into Data Explorer pools via Synapse Link rather than creating your own ETL process. The previous Cosmos DB connector for Synapse Link tied to a dedicated SQL pool.

Comments closed

Row-Level Security and Data Migration

Forrest McDaniel shares an interesting case of using row-level security:

This was the situation I found myself in earlier this year – our company had absorbed another, and it was time to slurp up their tables. There were a lot of decisions to make and tradeoffs to weigh, and we ended up choosing to trickle-insert their data, but make it invisible to normal use until the moment of cutover.

The way we implemented this was with Row Level Security. Using an appropriate predicate, we could make sure ETL processes only saw migrated data, apps saw unmigrated data, and admins saw everything. To give a spoiler: it worked, but there were issues.

I would not have thought of this scenario. And given the difficulties Forrest & crew ran into, it might be for the best…

Comments closed

Script Activity Outputs to ForEach Inputs with ADF

Meagan Longoria links in a script:

In early 2022, Microsoft released a new activity in Azure Data Factory (ADF) called the Script activity. The Script activity allows you to execute one or more SQL statements and receive zero, one, or multiple result sets as the output. This is an advantage over the stored procedure activity that was already available in ADF, as the stored procedure activity doesn’t support using the result set returned from a query in a downstream activity.

However, when I went to find examples of how to reference those result sets, the documentation was lacking. 

Click through as Meagan corrects a gap in documentation.

Comments closed

Reading Serverless SQL Pool Data with Data Factory

Koen Verbeeck wants to read from the serverless SQL pool in Azure Synapse Analytics:

We have some data we can query using the serverless SQL pools in Azure Synapse Analytics. For this blog post, I’m querying data that is stored in Azure Cosmos DB. Read the blog post How to Store Normalized SQL Server Data into Azure Cosmos DB to learn more about how that data got there.

Suppose I now want to read the data using Azure Data Factory. You can read data from Cosmos DB directly, but let’s pretend I want to do some transformations first using my favorite language: SQL. How can we do this?

Read on to learn how.

Comments closed

Granular Billing for Azure Data Factory

Chenye Charlie Zhu announces a new feature:

By default, Azure Data Factory reports lump sum charges for billing, meaning that at the factory level, we add up charges across all pipelines within a factory, and tell you how much you have spent on these pipelines. In many cases, these aggregate numbers should suffice. But in others, these numbers lack the clarity and transparency that we thrive to provide customers. For instance, if you are running data pipelines for multiple teams, you may want to determine the cost for each pipeline, for proper book-keeping and/or charge backs.

Now, Azure Data Factory will help you with this endeavor, with built-in per pipeline detailed billing view. Moreover, we built the feature on top of the Azure Billing and Cost Analysis platform, allowing you to stay with the cost and budget management tool that you are familiar with to identify spending trends and spot where overspending might have occurred.

Great if you have half a dozen pipelines. Probably less great if you have 500.

Comments closed

Interacting with Microsoft Graph API via Synapse

Paul Hernandez starts a new series:

In this and the next post I want to show you how to connect to the Microsoft Graph API, request some data, process it and store it in a database using Synapse Analytics. 

This first post presents a sample use case, briefly introduces the Graph API, how to create a linked service to it, and how to start querying data. In the next post a sample  Synapse pipeline will be described. The pipeline grabs some data and copies it into some target tables. Finally, I will create a sample query to showcase the newly imported data. 

Because there’s some potential confusion to people, Graph API is completely different from the idea of graph databases.

Comments closed

Real-Time Streaming ETL with Kafka and Debezium

Dursun Koc doesn’t have time for batched ETL:

Debezium is not extracting data using SQL. It uses database log files to track the changes in the database, so it has minimum effect on the source system. For more information about Debezium, please visit their website

After the data is extracted, we need Kafka Connect to stream it into Apache Kafka in order to play with it and reshape it as we required. And we will be using ksqlDB in order to reshape the raw data in a way we are required in the target system. Let’s consider a simple ordering system database in which we have a customer table, a product table, and an orders table, as shown below.

Read on for an overview as well as a link to the GitHub repo where you can try this all out.

Comments closed

Using the ShortCircuitOperator in Airflow

Lior Gavish shows off a useful operator in Apache Airflow:

But what happens when Airflow testing doesn’t catch all of your bad data? What if “unknown unknown” data quality issues fall through the cracks and affect your Airflow jobs? 

One helpful but underutilized solution is to leverage the Airflow ShortCircuitOperator to create data circuit breakers to prevent bad data from flowing across your data pipelines.

Data circuit breakers are powerful, but as with most data quality tactics, the nuances of how they are implemented are critical. Otherwise, you can make a bad problem worse.

Read on to learn more about the operator and how you can use it. The code block images are a bit fuzzy but still readable enough. It might be a little clearer on the original post.

Comments closed

Data Modification with Synapse Link for SQL Server 2022

Kevin Chant changes some data:

In this post I want to cover some things that happen internally when you do updates and deletes with Azure Synapse Link for SQL Server 2022 whilst it is running.

Because recently somebody asked if Azure Synapse Link for SQL Server 2022 captures updates and deletes after they had read a previous post. Where I covered my initial tests for Azure Synapse Link for SQL Server 2022.

Anyway, short answer is that Azure Synapse Link for SQL Server 2022 captures updates and deletes. In this post I will go into more detail about some of the things that appear to happen along the way.

Click through for Kevin’s tests and what the results look like.

Comments closed

Power Automate and Dataset-Driven Power BI Subscriptions

Dan English follows up on a prior topic:

In the last post I went over using Power Automate to perform a data driven report subscription using a Paginated report referencing an AAS database. The flow referenced an Excel file with the information to make the process data driven and generate 2000 PDF files that could then be emailed to users. In the flow the PDF files were simply placed in a OneDrive folder for testing purposes to validate the flow would run as expected and to review the metrics after the fact to evaluate the impact of running the process.

For the follow up there were two items that I wanted to compare against the original flow

1. Moving the AAS database being referenced to a Power BI dataset hosted in the same capacity as the Paginated report

2. Using a Power BI report instead of a Paginated report

In this post I will cover the first comparison. 

Check out what changes and what stays the same between using Azure Analysis Services and Power BI-hosted datasets.

Comments closed