Press "Enter" to skip to content

Category: ETL / ELT

ETL Orchestration and Air Traffic Control

Jens Vestergaard extends a metaphor:

We have been working getting an enterprise grade event driven orchestration of our ETL system to operate like an airport control tower, managing a fleet of flights (data processes) as they progress through various stages of take-off, transit, and landing. All of this, because Microsoft Fabric has a core-based limit to the number of Notebook executions that a capacity can execute and have queued up in line for execution when invoking them using the REST API. Read the details here: limits (you know, it’s funny that there is no stated limits for Azure Service Bus Queues on number of messages in queue, but there is for Microsoft Fabric, which uses a Service Bus queue underneath…)

That limitation is a bit annoying, but read on to see how Jens uses this metaphor to explain the various parts of an ETL orchestration engine.

Comments closed

Self-Hosted Integration Runtime Reconnecting to Cloud Service

Nivritti Suste handles an error:

In our organization, most data is stored on-premises with a limited set of less critical data is in the cloud. We use Azure to benefit from the cloud environment and Azure Data Factory (ADF) to move data.

With ADF, there are many components that need to integrate within the environment. The data on our on-premises servers needs to be shifted to the cloud periodically and we use Self-hosted Integration Runtime.

Our developers complain an ADF pipeline is failing with error: ‘The Self-hosted Integration Runtime is offline…’ What does this mean?

Click through for the answer.

Comments closed

Looping through bcp Calls in Powershell

Peter Schott needs to perform a series of bulk inserts:

Sometimes you need to extract a large number of tables into some other format. I’ve written about BCP earlier as a quick option to move data around, but what if you need a delimited text file or a way to repeat the calls without too much trouble across a set of servers? I had this come up recently and wrote up a combination of PowerShell, the sqlserver module, and the bcp.exe tool to allow for a variety of extract types and also, importantly, ensure that we get header data with those extracts in case we need them to pull in to another process. This is doable without the header piece, but many processes (and humans) expect some sort of header data. Thus – a slightly expanded process.

Click through for a script and an explanation of the process.

Comments closed

Power BI Semantic Model Monthly Refresh via Fabric Data Pipelines

Chris Webb has another way for scheduling refreshes:

I’m sure you already know how to configure scheduled refresh for your semantic models in Power BI. While the options you have for controlling when refresh takes place are generally good enough – you can configure daily or weekly refreshes and set up to eight times a day for refreshes to take place – there are some scenarios it doesn’t work for, such as monthly refreshes. Up to now the workaround has been to use Power Automate to trigger refreshes (see here for an example) or to call the refresh API from another application. Now, with Fabric, you have a much better option for scheduling refreshes: Data Pipelines.

Click through for the demonstration.

Comments closed

Dealing with Schema Drift in Azure Data Factory

Rayis Imayev deals with change:

I will jump straight to the problem statement without a “boring” introduction, which, in a sense, already feels like an opening statement.

Moving data between two or more endpoints is a common task. Sometimes it’s as simple as migrating data from one place to another. Other times, it’s a request to copy specific documents from source environments. In more complex cases, you might need to consolidate multiple data files into the same destination, such as loading several separate files into a single database table.

This was the bete noir of SSIS’s existence. Minor metadata changes would cause the entire system to break down.

Comments closed

Controlling Execution Flow in Fabric Data Pipelines

Reza Rad has everything under control:

In Microsoft Fabric, the Data Factory is the workload for ETL and data integration, and the Data Pipeline is a component in that workload for orchestrating the execution flow. There are activities in the pipeline, and you can define in which order you want the activities to run. In this article and video, you will learn about the execution order and output states in Data Pipeline and how they can be used in real-world scenarios of data integration.

The mechanisms here are fundamentally similar to what we’ve had in Azure Data Factory (obviously) and SQL Server Integration Services.

Comments closed

Notes on Change Tracking for Warehouse Incremental Loads

Meagan Longoria shares some hard-earned experience:

I have a few clients that incrementally load tables from a SQL Server source into their data warehouse or lakehouse by using change tracking. Lately, they encountered some issues with changes to the configuration and the data in the source database, so I decided to share some things you can check before using change tracking as part of your ETL load or when troubleshooting your data load.

Click through for three common issues you may run into while using change tracking.

Comments closed

Converting Excel Dates and Times to SQL

Kristyna Ferris marks the date:

Hey data friends! This one comes from my personal vault (aka backlog of drafts I’ve been needing to write up) and is a really simple code that I always forget how to do. So, to save us all some ChatGPT-ing, here’s my tried-and-true way of converting Excel Date & Time fields to a true Date & Time in SQL.

Click through for an example of the process.

Comments closed

Automatically Refreshing a Power BI Semantic Model after Dataflow Loads

Reza Rad refreshes a model:

Although this seems to be a simple thing to do, it is not a function that you can turn on or off. If you have a Dataflow that does the ETL and transforms and prepares the data, then to get the most up-to-date data into the report, you will need to refresh the Power BI semantic model after that, only upon successful refresh of both dataflow and semantic model is when you will have the up-to-date data into the report. Fortunately, in Fabric, this is a straightforward setup. In this article and video, I’ll explain how this is possible.

Click through for the video and the blog post. Granted, this feature is in preview, but using it is pretty straightforward.

Comments closed