ETL / ELT – Page 4 – Curated SQL

Converting Excel Dates and Times to SQL

Published 2025-01-30 by Kevin Feasel

Hey data friends! This one comes from my personal vault (aka backlog of drafts I’ve been needing to write up) and is a really simple code that I always forget how to do. So, to save us all some ChatGPT-ing, here’s my tried-and-true way of converting Excel Date & Time fields to a true Date & Time in SQL.

Click through for an example of the process.

Comments closed

Automatically Refreshing a Power BI Semantic Model after Dataflow Loads

Published 2025-01-28 by Kevin Feasel

Reza Rad refreshes a model:

Although this seems to be a simple thing to do, it is not a function that you can turn on or off. If you have a Dataflow that does the ETL and transforms and prepares the data, then to get the most up-to-date data into the report, you will need to refresh the Power BI semantic model after that, only upon successful refresh of both dataflow and semantic model is when you will have the up-to-date data into the report. Fortunately, in Fabric, this is a straightforward setup. In this article and video, I’ll explain how this is possible.

Click through for the video and the blog post. Granted, this feature is in preview, but using it is pretty straightforward.

Comments closed

MDX vs DAX for Bulk Data Extraction from Power BI

Published 2025-01-27 by Kevin Feasel

Chris Webb performs a test but gives us a warning first:

This is a post I’ve avoided writing for many years, and before I carry on let me make one thing clear:

Doing bulk extracts of data from a Power BI semantic model is a **really** bad idea

My colleague Matthew Roche wrote a great post on this topic a couple of years ago that is still relevant: using Power BI (or Analysis Services) as a data source for other systems, including other Power BI Import mode semantic models, is an anti-pattern. Power BI is optimised for small, analytical queries that return the amount of data that can be visualised on a single page. It is not optimised for queries that return millions of rows.

After the warning, Chris still gives a performance breakdown for extracting data from a semantic model in Excel, using automated MDX and DAX.

Comments closed

Three Incremental Load Patterns with Azure Data Factory

Published 2025-01-22 by Kevin Feasel

Temidayo Omoniyi likes a good pattern:

This article is divided into three major sections—each showing the different abilities and use cases of performing incremental load with Azure Data Factory. This process can also be done in an Azure Synapse Pipeline and Fabric Pipeline.

The document contains the following:

Section 1: Copy Data Based on Last Modified Date or Latest File

Section 2: Incremental Copy Using Dataflow

Section 3: Incremental Copy Using Lookup and Stored Procedure Activities

Click through for each of these three patterns, with plenty of screenshots and step-by-step instructions.

Comments closed

Methods to Copy On-Premises SQL Server Data into Microsoft Fabric

Published 2025-01-09 by Kevin Feasel

Gilbert Quevauvilliers runs a test:

In this blog post I am going to determine which item workload uses the least amount of Capacity Units when copying the same data from an On-Premises SQL Server.

The item workloads that I can use to copy data are Dataflow Gen1, Dataflow Gen2 and Pipelines.

Read on for the results, as well as one caveat about them.

Comments closed

Fabric Benchmarking: Moving CSV Files

Published 2024-12-17 by Kevin Feasel

Eugene Meidinger breaks out the abacus:

First, a disclaimer: I am not a data engineer, and I have never worked with Fabric in a professional capacity. With the announcement of Fabric SQL DBs, there’s been some discussion on whether they are better for Power BI import than Lakehouses. I was hoping to do some tests, but along the way I ended up on an extensive Yak Shaving expedition.

I have likely done some of these tests inefficiently. I have posted as much detail and source code as I can and if there is a better way for any of these, I’m happy to redo the tests and update the results.

Part one focuses on loading CSV files to the files portion of a lakehouse. Future benchmarks look at CSV to delta and PBI imports.

I think Eugene did a fine job documenting everything in the process, and it was interesting to see relative price differences between different techniques for uploading a very large CSV file.

Comments closed

The Showdown: Spark vs DuckDB vs Polars in Microsoft Fabric

Published 2024-12-13 by Kevin Feasel

Miles Cole puts together a benchmark:

There’s been a lot of excitement lately about single-machine compute engines like DuckDB and Polars. With the recent release of pure Python Notebooks in Microsoft Fabric, the excitement about these lightweight native engines has risen to a new high. Out with Spark and in with the new and cool animal-themed engines— is it time to finally migrate your small and medium workloads off of Spark?

Before writing this blog post, honestly, I couldn’t have answered with anything besides a gut feeling largely based on having a confirmation bias towards Spark. With recent folks in the community posting their own benchmarks highlighting the power of these lightweight engines, I felt it was finally time to pull up my sleeves and explore whether or not I should abandon everything I know and become a DuckDB and/or Polars convert.

Read on for the method and results from several thoughtful tests.

Comments closed

Ways to Land Data into Microsoft Fabric OneLake

Published 2024-12-13 by Kevin Feasel

James Serra puts on a cape and takes on an iconic laugh:

Microsoft Fabric is rapidly gaining popularity as a unified data platform, leveraging OneLake as its central data storage hub for all Fabric-integrated products. A variety of tools and methods are available for copying data into OneLake, catering to diverse data ingestion needs. Below is an overview of what I believe are the key options:

Read on for a baker’s dozen methods.

Comments closed

A Gentle Introduction to dbt

Published 2024-12-12 by Kevin Feasel

Koen Verbeeck makes a change:

In a typical data warehouse project, data is loaded, transformed and stored into a data store using an ETL (extract – transform – load) process. Recently, there has been a shift to ELT processing, where data is first loaded into a data store (this can be a database, but also a data lake or something similar) and then transformed in-place.

Over the past years, the tool dbt – short for data build tool – has become quite popular in the data engineering world for handling such an ELT process. dbt takes on the role of the “T”, meaning it’s responsible for transforming the data in a certain data store. dbt is not meant for transferring data from one store to another, you’ll need another tool for this, such as Azure Data Factory for example.

Read on to see how it works.

Comments closed

Data Transformation with Dataflows Gen2

Published 2024-12-10 by Kevin Feasel

Boniface Muchendu provides an overview of Dataflows Gen2 in Microsoft Fabric:

Welcome to a journey into the world of data automation! Imagine working in an organization bustling with data scientists and analysts. In such an environment, you often need to gather and combine data from various sources for further analysis. You could do this manually, but why not leverage automation? In this blog, we’ll explore how to apply automation on data transformations using Dataflows Gen2 in Microsoft Fabric.

Admitting that I am not the primary audience for Dataflows Gen2, I’d still much rather write a Spark notebook and call it a day.

Comments closed

Category: ETL / ELT