ETL / ELT – Page 2 – Curated SQL

Shortcut Caching in Microsoft Fabric now GA

Published 2025-05-12 by Kevin Feasel

Trevor Olson announces a feature has become generally available:

Shortcuts in OneLake allow you to quickly and easily source data from external cloud providers and use it across all Fabric workloads such as Power BI reports, SQL, Spark and Kusto. However, each time these workloads read data from cross-cloud sources, the source provider (AWS, GCP) charges additional egress fees on the data. Thankfully, shortcut caching allows the data to only be sourced once and then used across all Fabric workloads without additional egress fees.

This is useful for data that hardly ever changes, and Trevor also shows you who can control the cache length and reset the cache. In addition, the on-premises gateway for shortcuts is now generally available, so you can take shortcuts of certain on-prem file systems.

Comments closed

Kafka Data Exploration with Tableflow

Published 2025-04-29 by Kevin Feasel

Robin Moffatt does some exploratory data analysis:

One of the challenges that I’d always had when it came to building streaming data pipelines is that once data is in a Kafka topic, it becomes trickier to query. Whether limited by the available tools to do this or the speed of access, querying Kafka is just not a smooth experience.

This blog post will show you a really nice way of exploring and validating data in Apache Kafka®. We’ll use Tableflow to expose the Kafka topics as Apache Iceberg™️ tables and then query them using standard SQL tools.

Click through for the demonstration using a real dataset.

Comments closed

Troubleshooting a Slow Mapping Data Flow in Azure Synapse Analytics

Published 2025-04-21 by Kevin Feasel

Reitse Eskens has the need for speed:

The issue was quite straightforward. The client has a mapping data flow in Synapse that processes a few hundred to a few thousand rows but takes 15 minutes to complete. The low number of rows compared to the time necessary is a cause for concern.

The data extraction needs a staging storage account where the data is written into TXT files. The second step of the mapping data flow reads the TXT files and writes them out in delta format, which is Parquet files.

The source is an S4Hana CDC table, the target of which is a regular Azure storage account.

Read on for Reitse’s summarization of the troubleshooting and testing process, as well as what ended up working for this customer.

Comments closed

400 Bad Request when Debugging a Data Factory Pipeline

Published 2025-04-07 by Kevin Feasel

Koen Verbeeck runs into a problem:

I recently had a new pipeline fail. It was actually a copy of an old pipeline where I had made some adjustments into as part of a database migration. When triggered during an execution run, it failed saying some expression could not be parsed. When I went into the pipeline and triggered a debug, it immediately failed with the following helpful error message:

Click through for the error message and how Koen was able to fix the issue.

Comments closed

Calling a Microsoft Fabric REST API via Azure Data Factory

Published 2025-04-02 by Kevin Feasel

Koen Verbeeck makes the call:

Suppose you want to call a certain Microsoft Fabric REST API endpoint from Azure Data Factory (or Synapse Pipelines). This can be done using a Web Activity, and most Fabric APIs now support service principals or managed identities. Let’s illustrate with an example. I’m going to call the REST API endpoint to create a new lakehouse.

Click through for the instructions.

Comments closed

The Power of TABLOCK for Bulk Insertion

Published 2025-03-25 by Kevin Feasel

Mehdi Ghapanvari explains why TABLOCK can be useful for bulk inserts:

Does bulk insert performance improve when you use the TABLOCK hint? In some cases, YES! Let’s take a look at this in action to see how this hint could improve insert performance when using SQL bulk insert with TABLOCK.

Of course, standard TABLOCK concurrency rules (specifically, the lack of concurrency) apply.

Comments closed

Speeding up Dataflow Validation and Publish Times

Published 2025-03-24 by Kevin Feasel

Chris Webb doesn’t want to wait:

If you’re working with slow data sources in Power BI/Fabric dataflows then you’re probably aware that validation (for Gen1 dataflows) or publishing (for Gen2 dataflows) them can sometimes take a long time. If you’re working with very slow data sources then you may run into the 10 minute timeout on validation/publishing that is documented here. For a Gen1 dataflow you’ll see the following error message if you try to save your dataflow and validation takes more than 10 minutes:

Click through for that common error message, as well as some tips to avoid this issue. There was also an interesting approach in the comments section that circumvented the problem as well.

Comments closed

COPY and \COPY in PostgreSQL

Published 2025-03-20 by Kevin Feasel

Dave Stokes runs two commands:

PostgreSQL is equivalent to a Swiss Army Knife in the database world. There are things in PostgreSQL that are very simple to use, while in another database, they take many more steps to accomplish. But sometimes, the knife has too many blades, which can cause confusion. This is one of those cases.

Read on to understand what the difference is between these two commands.

Comments closed

Improving the Microsoft Fabric Copy Job

Published 2025-03-19 by Kevin Feasel

Krishnakumar Rukmangathan makes a copy:

Copy Job has been a go-to tool for simplified data ingestion in Microsoft Fabric, offering a seamless data movement experience from any source to any destination. Whether you need batch or incremental copying, it provides the flexibility to meet diverse data needs while maintaining a simple and intuitive workflow.

We continuously refine Copy Job based on customer feedback, enhancing both functionality and user experience. In this update, we’re introducing three key UX improvements designed to streamline your workflow and boost efficiency.

Read on for those three improvements.

Comments closed

Writing Data into a Microsoft Fabric Lakehouse via Notebook

Published 2025-03-12 by Kevin Feasel

Stepan Resl writes some code:

Since Lakehouse is one of the key items within Microsoft Fabric, it is important to know how to write data into it in various formats and using different tools. One of the most common tools is notebooks, as they provide great flexibility and speed for development and testing with graphical outputs. In this article, I want to focus primarily on the following types of notebooks:

PySpark

Python

Click through to see how it works in both notebook types.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: ETL / ELT