Press "Enter" to skip to content

Category: ETL / ELT

Apache Airflow Jobs in Fabric Data Factory

Mark Kromer makes an announcement:

The world of data integration is rapidly evolving, and staying up to date with the latest technologies is crucial for organizations seeking to make the most of their data assets. Available now are the newest innovations in Fabric Data Factory pipelines and Apache Airflow job orchestration, designed to empower data engineers, architects, and analytics professionals with greater efficiency, flexibility, and scalability.

Read on to see what’s newly available, including some preview functionality.

Leave a Comment

Generating Excel Reports via Fabric Dataflows Gen2

Chris Webb builds a report:

So many cool Fabric features get announced at Fabcon that it’s easy to miss some of them. The fact that you can now not only generate Excel files from Fabric Dataflows Gen2, but that you have so much control over the format that you can use this feature to build simple reports rather than plain old data dumps, is a great example: it was only mentioned halfway through this blog post on new stuff in Dataflows Gen2 Nonethless it was the Fabcon feature announcement that got me most excited. This is because it shows how Fabric Dataflows Gen2 have gone beyond being just a way to bring data into Fabric and are now a proper self-service ETL tool where you can extract data from a lot of different sources, transform it using Power Query, and load it to a variety of destinations both inside Fabric and outside it (such as CSV files, Snowflake and yes, Excel).

Click through for an example.

Leave a Comment

Batch versus Stream for Data Processing

Nikola Ilic answers a question and then the follow-up question:

If you’ve spent any time in the data engineering world, you’ve likely encountered this debate at least once. Maybe twice. Ok, probably a dozen times “Should we process our data in batches or in real-time?” And if you’re anything like me, you’ve noticed that the answer usually starts with: “Well, it depends…”

Which is true. It does depend. But “it depends” is only useful if you actually know what it depends on. And that’s the gap I want to fill with this article. Not another theoretical comparison of batch vs. stream processing (I hope you already know the basics). Instead, I want to give you a practical framework for deciding which approach makes sense for your specific scenario, and then show you how both paths look when implemented in Microsoft Fabric.

Read on to learn why both are viable patterns and how you can work with both in Microsoft Fabric.

Leave a Comment

dbt and Microsoft Fabric

Pradeep Srikakolapu and Abhishek Narain dig into dbt:

Modern analytics teams are adopting open, SQL-first data transformation, robust CI/CD and governance, and seamless integration across lakehouse and warehouse platforms. dbt is now the standard for analytics engineering, while Microsoft Fabric unifies data engineering, science, warehousing, and BI in OneLake.

By investing in dbt + Microsoft Fabric integration, Microsoft empowers customers with a unified, enterprise-grade analytics platform that supports native dbt workflows—future-proofing analytics engineering on Fabric.

I’ll be interested to see if this retains corporate investment longer than some of their open-source collaborations. That’s been a consistent issue over the years: announce some neat integrations with a popular technology, release a couple of versions, and then quietly deprecate it a year or two later. This sounds like it’s less likely to end up in that boat, simply based on how the Fabric team is collaborating compared to, say, the various Spark on .NET efforts over the years.

Leave a Comment

Query Folding and Staging in Fabric Dataflows Gen2

Chris Webb goes digging:

A few years ago I wrote this post on the subject of staging in Fabric Dataflows Gen2. In it I explained what staging is, how you can enable it for a query inside a Dataflow, and discussed the pros and cons of using it. However one thing I never got round to doing until this week is looking at how you can tell if query folding is happening on staged data inside a Dataflow – which turns out to be harder to do than you might think.

Read on to learn more, and also check out the comment describing an alternative approach to part of Chris’s solution.

Leave a Comment

Partitioned Compute and Fabric Dataflow Performance

Chris Webb performs a test:

Partitioned Compute is a new feature in Fabric Dataflows that allows you to run certain operations inside a Dataflow query in parallel and therefore improve performance. While UI support is limited at the moment it can be used in any Dataflow by adding a single line of fairly simple M code and checking a box in the Options dialog. But as with a lot of performance optimisation features (and this is particularly true of Dataflows) it can sometimes result in worse performance rather than better performance – you need to know how and when to use it. And so, in order to understand when this feature should and shouldn’t be used, I decided to do some tests and share the results here.

Click through for the test, the result, and an open door for subsequent analysis.

Leave a Comment

Microsoft Fabric Mirroring and SQL Server 2025

Meagan Longoria takes a peek at mirroring in Microsoft Fabric:

Mirroring of SQL Server databases in Microsoft Fabric was first released in public preview in March 2024. Mirrored databases promise near-real-time replication without the need to manage and orchestrate pipelines, copy jobs, or notebooks. John Sterrett blogged about them last year here. But since that initial release, the mechanism under the hood has evolved significantly.

Read on to see how this behaves for versions of SQL Server prior to 2025, and how it changes in 2025.

Comments closed

An Introduction to pg_duckpipe

Yuwei Xiao needs a way to move data:

When we released pg_ducklake, it brought a columnar lakehouse storage layer to PostgreSQL: DuckDB-powered analytical tables backed by Parquet, with metadata living in PostgreSQL’s own catalog. One question kept coming up: how do I keep these analytical tables in sync with my transactional tables automatically?

This is a real problem. If you manage DuckLake tables by hand, running periodic ETL jobs or batch inserts, you end up with stale data, extra scripts to maintain, and an operational surface area that grows with every table. For teams that want fresh analytical views of their OLTP data, this quickly becomes painful.

pg_duckpipe addresses this. It is a PostgreSQL extension (and optionally a standalone daemon) that streams changes from regular heap tables into DuckLake columnar tables in real time. No Kafka, no Debezium, no external orchestrator. Just PostgreSQL.

Click through to learn more about how it works.

Comments closed

Preview-Only Steps in Microsoft Fabric Dataflows

Chris Webb covers a new feature:

I have been spending a lot of time recently investigating the new performance-related features that have rolled out in Fabric Dataflows over the last few months, so expect a lot of blog posts on this subject in the near future. Probably my favourite of these features is Preview-Only steps: they make such a big difference to my quality of life as a Dataflows developer.

The basic idea (which you can read about in the very detailed docs here) is that you can add steps to a query inside a Dataflow that are only executed when you are editing the query and looking at data in the preview pane; when the Dataflow is refreshed these steps are ignored. This means you can do things like add filters, remove columns or summarise data while you’re editing the Dataflow in order to make the performance of the editor faster or debug data problems. It’s all very straightforward and works well.

First up, that feature is pretty interesting, though I could see things break if you only do your testing in the preview pane. Second, what Chris does with this is quite interesting.

Comments closed