Press "Enter" to skip to content

Category: ETL / ELT

Comparing Microsoft Fabric Consumption for Notebooks and Warehouse SQL Queries

Gilbert Quevauvilliers performs a comparison:

I saw that there was an update where it is now possible to use the Microsoft Fabric Warehouse to copy data directly from OneLake into the Warehouse.

This got me thinking, which would consume more capacity to get the data into the Warehouse table. As well as which one would be faster.

To do this I am going to be running a SQL query in the Warehouse.

Next, I will use a Notebook to copy the data from the OneLake files section to a Warehouse table.

Gilbert’s specific query involves loading data from a variety of CSV files into a lakehouse via notebook, and then into a warehouse table. Read on for the results.

Leave a Comment

Dataflows Gen2 Tips and Tricks

Jon Vöge provides advice on the least beloved ELT process:

Dataflows Gen2 are frequently (and often rightfully so) bashed for their performance inefficiencies. Especially in comparison with other ingestion and transformation tools in Fabric (Notebooks, Pipelines, Copy Jobs, SPROCs).

The fact remains however, that in the hands of a self-service developer, they are an incredibly powerful tool – if you can spare the compute on your capacity.

In this article, I will highlight tips and tricks to make the most of working with Dataflow Gen2 in Fabric. The list is by no means exhaustive, but simply consists of a bunch of tips which I found useful in the past year, including new and overlooked features, as well as old best practices:

Read on for some things that are new to Dataflows Gen2, working with SharePoint, and making data loads not quite as slow.

Leave a Comment

Materializing Lake Views in Microsoft Fabric

Sairam Yeturi reduces ETL and ELT requirements:

Organizations often face challenges when trying to scale analytics across large volumes of data stored in centralized SQL databases. As business teams demand faster, more tailored insights, traditional reporting pipelines can become bottlenecks. By adopting Lakehouse architecture with Microsoft Fabric, business groups can mirror their SQL data into OneLake and organize it using the Medallion architecture—Bronze, Silver, and Gold layers. Materialized lake views play a crucial role in this setup, enabling automated, declarative transformations that clean and enrich data in the Silver layer. This empowers teams to build reliable dashboards and AI-driven insights on top of curated data, all while maintaining performance, governance, and security on a scale.

In this post, we will cover how enterprises can use materialized lake views to streamline data orchestration and enhance data quality, monitoring across silver and gold layers, while mirroring their SQL DB tables to Fabric in the Bronze layer.

The best use case for this is a scenario in which your underlying data is already essentially in a star schema or at least easily transformable into one, and you have no interest in modifying the data in the view directly. Do read the limitations before digging in, though, as there are some big ones.

Leave a Comment

Microsoft Fabric Pipeline Copy Job Activity in Preview

Connie Xu makes an announcement:

We’re thrilled to announce that the Copy job Activity is now in Preview! 

This new orchestration activity brings the simplicity of the Copy job item directly into your Microsoft Fabric Data Factory pipelines, enabling you to manage data movement alongside transformations, notifications, and more; all in one place. 

Read on to learn more about it, including how it differs from the Copy activity and the Copy job item.

Comments closed

Scheduling Copy Jobs in Microsoft Fabric

Ye Xu can run more than once:

Copy Job is the go-to solution in Microsoft Fabric Data Factory for simplified data movement. With native support for multiple delivery styles, including bulk copy, incremental copy, and change data capture (CDC) replication, Copy job offers the flexibility to handle a wide range of scenarios—all through an intuitive, easy-to-use experience.

In this update, we’re excited to announce a powerful new enhancement: multiple scheduler support. This gives you even greater control over when your data moves.

Click through for a screenshot showing how you can set up multiple schedules for a specific copy job. Based on the screenshot, it seems that there is a limit to the number of schedules you can create, though that number (20) is large enough that I couldn’t imagine it being a major impediment for most people.

Comments closed

Loading Data from Network-Protected Storage Accounts into OneLake

Matt Basile grabs some data:

AzCopy is a powerful and performant tool for copying data between Azure Storage and Microsoft OneLake, and is the preferred tool for large-scale data movement due to its ease of use and built-in performance optimizations. AzCopy now supports copying data from firewall-enabled Azure Storage accounts into OneLake using trusted workspace access. Now you can use AzCopy to load data from even network-protected storage accounts, letting you effortlessly load data into OneLake without compromising on security or performance.

Click through for an explanation of trusted workspace access, followed by the steps to try it out for yourself.

Comments closed

Tips for Solving SSIS Package Bottlenecks

Andy Brownsword has some advice:

Last time out we started to look at optimising SSIS packages by showing how to identify bottlenecks with a handy script. This time we’re turning insights into action to solve those pain points.

The solutions are grouped into 3 areas: Data Flows, as they do a lot of heavy lifting; the Execute SQL task, which can also be used for transformation and calculations; and finally everything else (because the first two are usually the issue).

Andy has some good advice and plenty of links to prior content around optimizing SSIS performance. One small thing I’d add is architectural: think about whether you can solve the slow part inside SQL Server. If you’re grabbing a huge amount of data from a SQL Server instance and then narrowing it down with filters, it might be a lot faster to transform that into a SQL query with a stronger WHERE clause. But let’s say there’s some small file you’re using to filter, so you need to pull all of the data out of SQL Server to compare against the small file so that you know what you need. Instead of pulling all of the data out of SQL Server or setting up a Lookup component to hit the SQL Server instance for each row in the file, how about loading that file into SQL Server and then writing a query to do the work?

In short, the database engine is typically going to be a much better at performance than an integration layer would be.

Comments closed

Using a Child Pipeline Variable in a Parent Pipeline in Fabric Data Factory

Justin Bird passes back some information:

I answered a question on the Fabric community on return variables recently and thought I would expand upon it in a blog post. The question was how to use a variable derived in a child pipeline downstream in the parent pipeline. The person was specifically deriving a json object and wanted to iterate on the values in the parent pipeline.

Click through for the solution.

Comments closed

Optimizing Multi-Notebook Jobs in Microsoft Fabric and AWS Glue

Daniel Janik flips a switch:

Are your Azure Fabric pipelines with multiple notebooks running slower than you’d like? Are you paying for more Spark compute time than you should be? The culprit might be a simple setting that’s easy to miss. In this blog post, we’ll dive into the “For pipeline running multiple notebooks” setting in Azure Fabric and explain why enabling it can significantly improve your pipeline’s performance and reduce your costs.

Click through for this, as well as a comparison with AWS Glue and ways to perform something similar there.

Comments closed

Data Cleansing Tips in Pandas

Jayita Gulati shares some tips:

Data preparation is one of the most time-consuming parts of any data science or analytics project, but it doesn’t have to be. With the proper techniques, Pandas can help you quickly transform messy and complex datasets into clean, ready-to-analyze formats. From handling missing data to reshaping and optimizing your DataFrames, a few tricks can save you hours of work.

In this article, you will discover seven practical Pandas tips that can speed up your data prep process and help you focus more on analysis and less on cleanup.

Two of the tips are basically “use functional programming techniques,” and I’m okay with that.

Comments closed