Press "Enter" to skip to content

Category: ETL / ELT

Trying out the Databricks For-Each Task

Chen Hirsh goes in a loop:

Databricks recently added a for-each task to their workflow capability. Workflows are Databricks jobs, like Data factory pipelines, or SQL server jobs, a pipeline that you can schedule, that include a number of tasks that together complete some business logic.

Theoretically, the long-awaited for-each task should make the run of multiple processes easier. for example, one of the things I often do is run a list of notebooks, each processing a different table, with no dependencies between them. At the moment I use parallel notebooks – https://docs.databricks.com/en/notebooks/notebook-workflows.html#run-multiple-notebooks-concurrently

As you will see later, this use case is not supported yet. But before that, let’s see what we can do with the for-each task.

Read on to see what it currently can do, and what it cannot.

Comments closed

Data Ingestion with Microsoft Fabric Copy Jobs

Reitse Eskens spends a bunch of time at the copier:

The copy job is essentially an abstraction of a pipeline reading data from the source system and writing the data into either a Lakehouse or a Warehouse. It really is ingesting data and nothing else. In my opinion that what copy data flows are meant to do and are very good at too.

The big challenge we all keep facing is how to create incremental loads. We have to build some sort of metadata database where we keep the latest ID, data or other column we use to discern the increment on. In our flow, we need to get that value, compare it against the source system and get the differences. The biggest task is to find out if records are deleted.

With the Copy Job, a large part of this task is taken out of your hands. The Copy Job has a configuration GUI (or wizard) that helps you out quite quickly. So let’s not waste anymore characters and dig in!

Read on to see how it works and its capabilities and limitations. The key question, as always, is whether your workload fits into the wheelhouse. If so, this sounds really useful. If not, it’s a proper struggle.

Comments closed

Cosmos DB HTAP into Azure Synapse Analytics and Microsoft Fabric

Paul Hernandez doesn’t want to write ETL jobs:

In the ever-evolving landscape of data management and analytics, choosing the right tools and approaches is crucial for optimizing performance and achieving business goals. Two prominent solutions that have gained traction are Azure Synapse Link for Azure Cosmos DB and Mirroring in Microsoft Fabric. Both offer unique benefits and cater to different needs, making it essential to understand their differences and use cases.

Read on to see how each of these works, as well as a quick demonstration of efficacy.

Comments closed

Type 2 SCDs in Microsoft Fabric

Reza Rad has changes to make:

In the previous article, I explained SCD (Slowly Changing Dimension) and its different types. In this article, I’ll show you how to implement SCD Type 2 (one of the most common types) using Microsoft Fabric and Power BI. This article includes using Lakehouse, Dataflow, Warehouse, Data Pipeline, SQL Stored Procedures, Power BI Semantic model, and report in Microsoft Fabric.

Click through to learn more about the structure of a type-2 slowly changing dimension, as well as how you can store and load this information to track changes over time.

Comments closed

Cross-Workspace Data Transfer in Microsoft Fabric

Reitse Eskens moves some data around:

When you open Fabric, the first thing you need to do is choose a so-called workspace. This serves as a container for all your Fabric items. You can have one or more workspaces and the design is entirely up to you. From one workspace to rule them all to one workspace for each set of items (Lakehouse, Warehouse, Semantic model and Report, Pipeline, Notebook etc). Until yesterday (the day this blogpost came online) it was impossible to use a pipeline to get data across different workspaces.

You could work around it with tricks like shortcuts, but it feels more natural (or maybe I’m just old ;)) to be able to read data from workspace 1 and write it into workspace 2.
So let’s see how this works and, where capacity is used!

Click through to see it in action.

Comments closed

Configuring Microsoft Fabric Data Mirroring for Snowflake

Koen Verbeeck copies some data:

We have a couple of Snowflake databases and would like to have that data available in Microsoft Fabric as well. Is there an easy solution to get the data quickly in Fabric? We don’t have many technical people on staff, so writing complex ETL is not an option.

Read on for more information on how it works. Mind you, you’re probably still writing the T and some of the L after using mirroring.

Comments closed

Debugging Failed Function Calls in ADF

Andy Brownsword troubleshoots a problem:

I recently ran into an issue when trying to call a function from an ADF pipeline. The function returned a generic Internal Server Error with no details exposed. Here we’ll look at how to dig into the logs to identify the true cause of the failure.

In this instance the function was performing PGP encryption but this could apply to any function. Let’s start with the problem.

Click through for the very generic error message and how you can get the real details.

Comments closed

Bringing SQL Server Data into Microsoft Fabric

Nikola Ilic shows us the current options:

Options, options, options…Having the possibility to perform a certain task in multiple different ways is usually a great “problem” to have, although very often not each option is equally effective. And, Microsoft Fabric is all about “options”…You want to ingest the data? No problem, you can use notebooks, pipelines, Dataflows, or T-SQL. Data transformation needed? No worries at all – again, you may leverage notebooks, T-SQL, Dataflows…Data processing, you asked? Lakehouse (Spark), Warehouse (SQL), Real-Time Intelligence (KQL), Power BI…The choice is yours again.

In a nutshell, almost every single task in Microsoft Fabric can be completed in multiple ways, and there is no “right” or “wrong” tool, as long as it gets the job done (of course, as efficiently as possible).

Nikola lays out two pre-requisites and then shows us two options we can currently use, and three potential options we currently cannot use.

Comments closed

Dynamic Warehouse and Lakehouse Connections in Data Pipelines

Koen Verbeeck doesn’t want to hard-code the connection string:

When you develop data pipelines in Microsoft Fabric (the Azure Data Factory equivalent in Fabric, not to be confused with deployment pipelines), you will most likely have some activities with a connection to a warehouse, a lakehouse or a KQL database (for the remainder of the blog post I’ll talk about a warehouse, but it can be any of those three data stores). For example, in a Script, Lookup, or Copy activity. When you deploy your data pipeline to another workspace – using, you might’ve guessed it, deployment pipelines – the pipeline itself is copied to the other workspace. E.g., we deploy a pipeline from the development workspace to the test workspace.

Read on to see what this means for warehouse connections and how you can work around the existing messiness.

Comments closed

Dataverse and Microsoft Fabric Gotchas

Marc Lelijveld shares some advice:

Recently, I architected a solution for a client for their Microsoft Fabric data platform. The client works with Dynamics Finance & Operations as one of their main ERP system. Fabric offers easy ways to bring data from various standard Microsoft services into the platform, however it is not always as easy as it looks like. In this blog I will elaborate on the gotcha’s encountered in architecting this solution.

Read on for the challenges that Marc ran into along the way.

Comments closed