Press "Enter" to skip to content

Category: ETL / ELT

Query Folding and Staging in Fabric Dataflows Gen2

Chris Webb goes digging:

A few years ago I wrote this post on the subject of staging in Fabric Dataflows Gen2. In it I explained what staging is, how you can enable it for a query inside a Dataflow, and discussed the pros and cons of using it. However one thing I never got round to doing until this week is looking at how you can tell if query folding is happening on staged data inside a Dataflow – which turns out to be harder to do than you might think.

Read on to learn more, and also check out the comment describing an alternative approach to part of Chris’s solution.

Comments closed

Partitioned Compute and Fabric Dataflow Performance

Chris Webb performs a test:

Partitioned Compute is a new feature in Fabric Dataflows that allows you to run certain operations inside a Dataflow query in parallel and therefore improve performance. While UI support is limited at the moment it can be used in any Dataflow by adding a single line of fairly simple M code and checking a box in the Options dialog. But as with a lot of performance optimisation features (and this is particularly true of Dataflows) it can sometimes result in worse performance rather than better performance – you need to know how and when to use it. And so, in order to understand when this feature should and shouldn’t be used, I decided to do some tests and share the results here.

Click through for the test, the result, and an open door for subsequent analysis.

Comments closed

Microsoft Fabric Mirroring and SQL Server 2025

Meagan Longoria takes a peek at mirroring in Microsoft Fabric:

Mirroring of SQL Server databases in Microsoft Fabric was first released in public preview in March 2024. Mirrored databases promise near-real-time replication without the need to manage and orchestrate pipelines, copy jobs, or notebooks. John Sterrett blogged about them last year here. But since that initial release, the mechanism under the hood has evolved significantly.

Read on to see how this behaves for versions of SQL Server prior to 2025, and how it changes in 2025.

Comments closed

An Introduction to pg_duckpipe

Yuwei Xiao needs a way to move data:

When we released pg_ducklake, it brought a columnar lakehouse storage layer to PostgreSQL: DuckDB-powered analytical tables backed by Parquet, with metadata living in PostgreSQL’s own catalog. One question kept coming up: how do I keep these analytical tables in sync with my transactional tables automatically?

This is a real problem. If you manage DuckLake tables by hand, running periodic ETL jobs or batch inserts, you end up with stale data, extra scripts to maintain, and an operational surface area that grows with every table. For teams that want fresh analytical views of their OLTP data, this quickly becomes painful.

pg_duckpipe addresses this. It is a PostgreSQL extension (and optionally a standalone daemon) that streams changes from regular heap tables into DuckLake columnar tables in real time. No Kafka, no Debezium, no external orchestrator. Just PostgreSQL.

Click through to learn more about how it works.

Comments closed

Preview-Only Steps in Microsoft Fabric Dataflows

Chris Webb covers a new feature:

I have been spending a lot of time recently investigating the new performance-related features that have rolled out in Fabric Dataflows over the last few months, so expect a lot of blog posts on this subject in the near future. Probably my favourite of these features is Preview-Only steps: they make such a big difference to my quality of life as a Dataflows developer.

The basic idea (which you can read about in the very detailed docs here) is that you can add steps to a query inside a Dataflow that are only executed when you are editing the query and looking at data in the preview pane; when the Dataflow is refreshed these steps are ignored. This means you can do things like add filters, remove columns or summarise data while you’re editing the Dataflow in order to make the performance of the editor faster or debug data problems. It’s all very straightforward and works well.

First up, that feature is pretty interesting, though I could see things break if you only do your testing in the preview pane. Second, what Chris does with this is quite interesting.

Comments closed

Troubleshooting Bad Request in ADF Pipelines

Koen Verbeeck said something bad:

A while ago I blogged about a use case where a pipeline fails during debugging with a BadRequest error, even though it validates successfully. If you’re wondering, this is the helpful error message that you get:

Click through for an image of the 400 Bad Request message, how Koen fixed it originally, and then a different scenario in which that 400 message popped up.

Ultimately, a 400 Bad Request comes down to “You sent me information that doesn’t make sense and I can’t fulfill your request, so fix it, dummy.” 400 status codes are very rude and insulting. Especially 418–that thing has a mouth like a sailor’s.

Comments closed

Creating Fabric Linked Service Parameters for ADO Deployment

Koen Verbeeck glues together several technologies:

Quite the title, so let me set the stage first. You have an Azure Data Factory instance (or Azure Synapse Pipelines) and you have a couple of linked services that point to Fabric artifacts such as a lakehouse or a warehouse. You want to deploy your ADF instance with an Azure Devops build/release pipeline to another environment (e.g. acceptance or production) and this means the linked services need to change as well because in those environments the lakehouse or warehouse are in a different workspace (and also have different object Ids).

When you want to deploy ADF, you typically use the ARM template that ADF automatically creates when you publish (when your instance is linked with a git repo). More information about this setup can be found in the documentation. To parameterize certain properties of a linked service, you can use custom parameterization of the ARM template. Anyway, long story short, I tried to parameterize the properties of the Fabric linked service. 

Read on to see how that went, as well as what you need to do to solve this issue.

Comments closed

Alerting People in Microsoft Teams from Data Factory Pipelines

Andy Brownsword sends a message:

Whether running Data Factory, Synapse, or Fabric pipelines, things go wrong – and the de facto response is to send an email. We’ve looked at sending emails from pipelines before, but at scale they can become noise and are easy to ignore.

A more effective option is to surface alerts where collaboration already exists, such as Teams.

In this post we’re going to start looking at using Teams and consolidate notifications into a channel. This functionality gives team members visibility, the ability to update in threads, and the option to tag people for a tighter response loop than typical emails bring.

Click through for the process.

Comments closed