Press "Enter" to skip to content

Category: ETL / ELT

The Downside of Zero-Copy Integration between Kafka and Iceberg

Jack Vanlightly lays out an argument:

Over the past few months, I’ve seen a growing number of posts on social media promoting the idea of a “zero-copy” integration between Apache Kafka and Apache Iceberg. The idea is that Kafka topics could live directly as Iceberg tables. On the surface it sounds efficient: one copy of the data, unified access for both streaming and analytics. But from a systems point of view, I think this is the wrong direction for the Apache Kafka project. In this post, I’ll explain why. 

Read on for an explanation of what “zero-copy” means here, as well as Jack’s position on the matter. I think it’s a solid argument and worth the read.

Leave a Comment

Updates to Microsoft Fabric Dataflows Gen2

Nikola Ilic digs into some announcements:

In the ocean of announcements from the recent FabCon Europe in Vienna, one that may have gone under the radar was about the enhancements in performance and cost optimization for Dataflows Gen2.

Before we delve into explaining how these enhancements impact your current Dataflows setup, let’s take a step back and provide a brief overview of Dataflows. For those of you who are new to Microsoft Fabric – a Dataflow Gen2 is the no-code/low-code Fabric item used to extract, transform, and load the data (ETL).

It sounds like these changes move Dataflows Gen2 from the “Never choose this” option to something that has become viable in at least some circumstances.

Leave a Comment

Migrating from Apache Airflow 2 to 3 on Amazon MWAA

Anurag Srivastava, et al, perform a migration:

Apache Airflow 3.x on Amazon MWAA introduces architectural improvements such as API-based task execution that provides enhanced security and isolation. Other major updates include a redesigned UI for better user experience, scheduler-based backfills for improved performance, and support for Python 3.12. Unlike in-place minor Airflow version upgrades in Amazon MWAA, upgrading to Airflow 3 from Airflow 2 requires careful planning and execution through a migration approach due to fundamental breaking changes.

This migration presents an opportunity to embrace next-generation workflow orchestration capabilities while providing business continuity. However, it’s more than a simple upgrade. Organizations migrating to Airflow 3.x on Amazon MWAA must understand key breaking changes, including the removal of direct metadata database access from workers, deprecation of SubDAGs, changes to default scheduling behavior, and library dependency updates. This post provides best practices and a streamlined approach to successfully navigate this critical migration, providing minimal disruption to your mission-critical data pipelines while maximizing the enhanced capabilities of Airflow 3.

Read on to see what has changed between these two major versions of Airflow, recommendations on what to look out for, and a step-by-step migration guide.

Leave a Comment

Updates to Fabric Data Factory

Abhishek Narain has a list of updates:

Workspace Private Link Support for Data Factory (Preview): Microsoft Fabric enables secure data integration through Private Link support in Dataflows Gen2, Pipelines, and Copy jobs. This ensures that inbound data access remains isolated and compliant within protected workspaces. By leveraging VNet data gateways, organizations can securely connect to data sources across Private Link-enabled environments—eliminating exposure to public networks and reinforcing enterprise-grade security for sensitive data operations.

Most of these are security-related updates, with a mixture of things now GA, things currently in preview, and a pair of items coming soon.

Leave a Comment

Microsoft Fabric Copy Job Updates

Ye Xu has an update:

Copy job is the go-to solution in Microsoft Fabric Data Factory for simplified data movement. With native support for multiple delivery styles, including bulk copy, incremental copy, and change data capture (CDC) replication, Copy job offers the flexibility to handle a wide range of scenarios—all through an intuitive, easy-to-use experience.

This update introduces several enhancements, including connection parameterization, expanded CDC capabilities, new connectors, and a streamlined Copy Assistant powered by Copy job.

Read on to see what’s new. Some of the items in this list are preview features, and it looks like others are currently GA.

Leave a Comment

Troubleshooting Bulk Insertion in SQL Server

Rick Dobson lays out some common issues:

Most SQL bulk insert and SQL Server openrowset tutorials skip file access issues that can stop imports cold. Both the bulk insert statement and openrowset function rely on the SQL Server service account to read a source file. The SQL Server service account must have read permission on the file or its folder. It is also convenient to have read & execute as well as list folder content permissions. Also, non-standard source file locations (e.g., C:\Users\Public\Downloads) may not grant default read access to the SQL Server service account – always verify before use.

Click through for several recommendations, links to additional resources, and a few scripts along the way.

Comments closed

Comparing Microsoft Fabric Consumption for Notebooks and Warehouse SQL Queries

Gilbert Quevauvilliers performs a comparison:

I saw that there was an update where it is now possible to use the Microsoft Fabric Warehouse to copy data directly from OneLake into the Warehouse.

This got me thinking, which would consume more capacity to get the data into the Warehouse table. As well as which one would be faster.

To do this I am going to be running a SQL query in the Warehouse.

Next, I will use a Notebook to copy the data from the OneLake files section to a Warehouse table.

Gilbert’s specific query involves loading data from a variety of CSV files into a lakehouse via notebook, and then into a warehouse table. Read on for the results.

Comments closed

Dataflows Gen2 Tips and Tricks

Jon Vöge provides advice on the least beloved ELT process:

Dataflows Gen2 are frequently (and often rightfully so) bashed for their performance inefficiencies. Especially in comparison with other ingestion and transformation tools in Fabric (Notebooks, Pipelines, Copy Jobs, SPROCs).

The fact remains however, that in the hands of a self-service developer, they are an incredibly powerful tool – if you can spare the compute on your capacity.

In this article, I will highlight tips and tricks to make the most of working with Dataflow Gen2 in Fabric. The list is by no means exhaustive, but simply consists of a bunch of tips which I found useful in the past year, including new and overlooked features, as well as old best practices:

Read on for some things that are new to Dataflows Gen2, working with SharePoint, and making data loads not quite as slow.

Comments closed

Materializing Lake Views in Microsoft Fabric

Sairam Yeturi reduces ETL and ELT requirements:

Organizations often face challenges when trying to scale analytics across large volumes of data stored in centralized SQL databases. As business teams demand faster, more tailored insights, traditional reporting pipelines can become bottlenecks. By adopting Lakehouse architecture with Microsoft Fabric, business groups can mirror their SQL data into OneLake and organize it using the Medallion architecture—Bronze, Silver, and Gold layers. Materialized lake views play a crucial role in this setup, enabling automated, declarative transformations that clean and enrich data in the Silver layer. This empowers teams to build reliable dashboards and AI-driven insights on top of curated data, all while maintaining performance, governance, and security on a scale.

In this post, we will cover how enterprises can use materialized lake views to streamline data orchestration and enhance data quality, monitoring across silver and gold layers, while mirroring their SQL DB tables to Fabric in the Bronze layer.

The best use case for this is a scenario in which your underlying data is already essentially in a star schema or at least easily transformable into one, and you have no interest in modifying the data in the view directly. Do read the limitations before digging in, though, as there are some big ones.

Comments closed

Microsoft Fabric Pipeline Copy Job Activity in Preview

Connie Xu makes an announcement:

We’re thrilled to announce that the Copy job Activity is now in Preview! 

This new orchestration activity brings the simplicity of the Copy job item directly into your Microsoft Fabric Data Factory pipelines, enabling you to manage data movement alongside transformations, notifications, and more; all in one place. 

Read on to learn more about it, including how it differs from the Copy activity and the Copy job item.

Comments closed