Press "Enter" to skip to content

Category: ETL / ELT

MDX vs DAX for Bulk Data Extraction from Power BI

Chris Webb performs a test but gives us a warning first:

This is a post I’ve avoided writing for many years, and before I carry on let me make one thing clear:

Doing bulk extracts of data from a Power BI semantic model is a **really** bad idea

My colleague Matthew Roche wrote a great post on this topic a couple of years ago that is still relevant: using Power BI (or Analysis Services) as a data source for other systems, including other Power BI Import mode semantic models, is an anti-pattern. Power BI is optimised for small, analytical queries that return the amount of data that can be visualised on a single page. It is not optimised for queries that return millions of rows.

After the warning, Chris still gives a performance breakdown for extracting data from a semantic model in Excel, using automated MDX and DAX.

Comments closed

Three Incremental Load Patterns with Azure Data Factory

Temidayo Omoniyi likes a good pattern:

This article is divided into three major sections—each showing the different abilities and use cases of performing incremental load with Azure Data Factory. This process can also be done in an Azure Synapse Pipeline and Fabric Pipeline.

The document contains the following:

Section 1: Copy Data Based on Last Modified Date or Latest File

Section 2: Incremental Copy Using Dataflow

Section 3: Incremental Copy Using Lookup and Stored Procedure Activities

Click through for each of these three patterns, with plenty of screenshots and step-by-step instructions.

Comments closed

Fabric Benchmarking: Moving CSV Files

Eugene Meidinger breaks out the abacus:

First, a disclaimer: I am not a data engineer, and I have never worked with Fabric in a professional capacity. With the announcement of Fabric SQL DBs, there’s been some discussion on whether they are better for Power BI import than Lakehouses. I was hoping to do some tests, but along the way I ended up on an extensive Yak Shaving expedition.

I have likely done some of these tests inefficiently. I have posted as much detail and source code as I can and if there is a better way for any of these, I’m happy to redo the tests and update the results.

Part one focuses on loading CSV files to the files portion of a lakehouse. Future benchmarks look at CSV to delta and PBI imports.

I think Eugene did a fine job documenting everything in the process, and it was interesting to see relative price differences between different techniques for uploading a very large CSV file.

Comments closed

The Showdown: Spark vs DuckDB vs Polars in Microsoft Fabric

Miles Cole puts together a benchmark:

There’s been a lot of excitement lately about single-machine compute engines like DuckDB and Polars. With the recent release of pure Python Notebooks in Microsoft Fabric, the excitement about these lightweight native engines has risen to a new high. Out with Spark and in with the new and cool animal-themed engines— is it time to finally migrate your small and medium workloads off of Spark?

Before writing this blog post, honestly, I couldn’t have answered with anything besides a gut feeling largely based on having a confirmation bias towards Spark. With recent folks in the community posting their own benchmarks highlighting the power of these lightweight engines, I felt it was finally time to pull up my sleeves and explore whether or not I should abandon everything I know and become a DuckDB and/or Polars convert.

Read on for the method and results from several thoughtful tests.

Comments closed

Ways to Land Data into Microsoft Fabric OneLake

James Serra puts on a cape and takes on an iconic laugh:

Microsoft Fabric is rapidly gaining popularity as a unified data platform, leveraging OneLake as its central data storage hub for all Fabric-integrated products. A variety of tools and methods are available for copying data into OneLake, catering to diverse data ingestion needs. Below is an overview of what I believe are the key options:

Read on for a baker’s dozen methods.

Comments closed

A Gentle Introduction to dbt

Koen Verbeeck makes a change:

In a typical data warehouse project, data is loaded, transformed and stored into a data store using an ETL (extract – transform – load) process. Recently, there has been a shift to ELT processing, where data is first loaded into a data store (this can be a database, but also a data lake or something similar) and then transformed in-place.

Over the past years, the tool dbt – short for data build tool – has become quite popular in the data engineering world for handling such an ELT process. dbt takes on the role of the “T”, meaning it’s responsible for transforming the data in a certain data store. dbt is not meant for transferring data from one store to another, you’ll need another tool for this, such as Azure Data Factory for example.

Read on to see how it works.

Comments closed

Data Transformation with Dataflows Gen2

Boniface Muchendu provides an overview of Dataflows Gen2 in Microsoft Fabric:

Welcome to a journey into the world of data automation! Imagine working in an organization bustling with data scientists and analysts. In such an environment, you often need to gather and combine data from various sources for further analysis. You could do this manually, but why not leverage automation? In this blog, we’ll explore how to apply automation on data transformations using Dataflows Gen2 in Microsoft Fabric.

Admitting that I am not the primary audience for Dataflows Gen2, I’d still much rather write a Spark notebook and call it a day.

Comments closed

Metadata-Driven Spark Clusters in Azure Databricks

Matt Collins ties the room together with a bit of metadata:

In this article, we will discuss some options for improving interoperability between Azure Orchestration tools, like Data Factory, and Databricks Spark Compute. By using some simple metadata, we will show how to dynamically configure pipelines with appropriately sized clusters for all your orchestration and transformation needs as part of a data analytics platform.

Click through for an explanation of the challenge, followed by the how-to.

Comments closed

Mounding ADF Instances in Microsoft Fabric

Koen Verbeeck has an existing Azure Data Factory:

We recently started using Microsoft Fabric for our cloud data platform. However, we already have quite an estate of Azure data services running in our company, including a huge number of Azure Data Factory (ADF) pipelines. It seems cumbersome to migrate all those pipelines to Microsoft Fabric, especially because some features are not supported yet and ADF is the mature choice at the moment. We like the concept of Microsoft Fabric’s centralization, where everything is managed in one platform. Is there an option to manage ADF in Fabric?

Read on for the answer, but make sure to check out its limitations as well.

Comments closed