Press "Enter" to skip to content

Day: February 27, 2026

A Primer on Apache Iceberg

Brendan Tierney provides an introduction to Apache Iceberg:

Modern data platforms increasingly separate compute from storage, using object stores as durable data lakes while scaling processing engines. Traditional “data lakes” built on Parquet files and Hive-style partitioning have limitations around atomicity, schema evolution, metadata scalability, and multi-engine interoperability. Apache Iceberg addresses these challenges by defining a high-performance table format with transactional guarantees, scalable metadata structures, and engine-agnostic semantics.

Apache Iceberg, an open-source table format that has become the industry standard for data sharing in modern data architectures. Let’s have a look at some of the key features, some of its limitations and a brief look at some of the alternatives.

Brendan explains where Iceberg fits in relation to data formats (e.g., Parquet, ORC, and Avro), as well as competitors like Delta Lake and Hudi.

Leave a Comment

JSONB Data in Postgres and Performance Due to TOAST

Paul Ramsey lays out the facts and the data:

Working with APIs and arrays in the jsonb type has become increasingly popular recently, and storing pieces of application data using jsonb has become a common design pattern.

But why shred a JSON object into rows and columns and then rehydrate it later to send it back to the client?

The answer is efficiency. Postgres is most efficient when working with rows and columns, and hiding data structure inside JSON makes it difficult for the engine to go as fast as it might.

Read on to learn how Postgres manages to store arbitrary-sized JSONB data within the limitations of 8KB pages, and the performance implications of doing so.

Leave a Comment

Pain Points around Direct Lake

Teo Lachev describes a pair of problems:

I’m helping an enterprise client modernize their data analytics estate. As a part of this exercise, a SSAS Multidimensional financial cube must be converted to a Power BI semantic model. The challenge is that business users ask for almost real-time BI during the forecasting period, where a change in the source forecasting system must be quickly propagated to the reporting the layer, so the users don’t sit around waiting to analyze the impact. An important part of this architecture is the Fabric Direct Lake storage to eliminate the refresh latency, but it came up with a couple of gotchas.

Click through for those two problems.

Leave a Comment

An Overview of the Fabric Native Execution Engine

Ankita Victor-Levi introduces a new processing model:

In today’s data landscape, as organizations scale their analytical workloads, the demand for faster, more cost-efficient computation continues to rise. Apache Spark has long been the backbone of largescale data processing with its in‑memory processing and powerful APIs, but today’s workloads demand even better performance.

Microsoft Fabric addresses this challenge with the Native Execution Engine—a vectorized, C++ powered execution layer that accelerates Spark jobs with no code changesreduced runtime, and at no additional compute cost. This blog post will take you behind the scenes to give an overview of how the engine works and how it delivers performance gains while preserving the familiar Spark developer experience users already know and love.

Read on to learn more about its capabilities and current limitations.

Leave a Comment

Building Power BI Reports from the Desktop or Fabric

James Serra clears up some confusion:

If you’re a Power BI report author who’s just getting into Microsoft Fabric, you’ve probably asked the same question I hear over and over: am I supposed to stop using Power BI Desktop now?

It’s a fair question. Power BI Desktop is a Windows app that has traditionally been the place where report authors do everything: get data, transform it, model it, and build the report. Microsoft even describes that “connect, shape/transform, then load” experience as part of how Power BI Desktop works with Power Query.

Fabric changes the feel of that workflow because Power BI is now also a first-class experience in the browser inside the Fabric portal. And that browser experience isn’t just “view and share” anymore. You can edit semantic models in the service, including using Power Query for import models and building reports directly from that same environment.

Read on to see, for a brand new report, which of the two models can make the most sense.

Leave a Comment

Connection Pooling in PostgreSQL vs SQL Server

Haripriya Naidu compares two systems:

If you speak SQL Server as your first language, then you might be aware that connections are thread-based by design. That means each session/connection in SQL Server gets a worker thread. That thread is tied to that session from start to finish of execution.
If there are no available threads, new connections wait in queue until threads become available. This is called a thread-based model.

Postgres is different, it uses a process-based model. Every single connection spawns a separate backend OS process and each of it consumes RAM (>5MB per connection).

It’s interesting that the RDBMS that really “needs” connection pooling doesn’t have it built in, whereas the one that doesn’t “need” connection pooling (but can still benefit greatly from it) does.

Leave a Comment

Combining UNION and UNION ALL

Greg Low crosses the streams:

Until the other day though, I’d never stopped to think about what happens when you mix the two operations. I certainly wouldn’t write code like that myself but for example, without running the code (or reading further ahead yet), what would you expect the output of the following command to be? (Note: The real code read rows from a table but I’ve mocked it up with a VALUES clause to make it easier to see the outcome).

Read on to see what happens.

Leave a Comment