Press "Enter" to skip to content

Author: Kevin Feasel

Microsoft Fabric Spark Connector for SQL Databases

Arshad Ali makes an announcement:

Fabric Spark connector for SQL databases (Azure SQL databases, Azure SQL Managed Instances, Fabric SQL databases and SQL Server in Azure VM) in the Fabric Spark runtime is now available. This connector enables Spark developers and data scientists to access and work with data from SQL database engines using a simplified Spark API. The connector will be included as a default library within the Fabric Runtime, eliminating the need for separate installation.

This is a preview feature and works with Scala and Python code against SQL Server-ish databases in Azure (Azure SQL DB, Azure SQL Managed Instance, and virtual machines running SQL Server in Azure).

Leave a Comment

Value Filter Behavior and SUMMARIZECOLUMNS in DAX

Alberto Ferrari and Marco Russo provide an introduction:

Value filter behavior controls the SUMMARIZECOLUMNS behavior that changes how filters are applied to the measure evaluation. This is mostly relevant when developers use the filter arguments in SUMMARIZECOLUMNS.

The topic is very broad, and we will just be able to scratch the surface here. However, the good news is that most developers do not need to learn the intricacies of value filter behavior. This property has three settings: Automatic, Independent, and Coalesced. The safest and most correct setting is the one introduced in 2025: Independent. Coalesced was the default setting before 2025, whereas Automatic retains Coalesced for older models, and it sets Independent for new models.

Read on to learn more about these behaviors and what they mean for your queries.

Leave a Comment

Lessons Learned from Replicating BigQuery to Microsoft Fabric

Teo Lachev shares some knowledge:

A recent engagement required replicating some DW tables from Google BigQuery to a Fabric Lakehouse. We considered the Fabric mirroring feature (back then in private preview, now publicly available) and learned some lessons along the way:

1. 400 Error during replication configuration – Caused by attempting to use a read-only GBQ dataset that is linked to another GBQ dataset but the link was broken.

Read on for additional tips, including a major one around permissions.

Leave a Comment

Simulating the Monty Hall Problem in R

Jason Bryer takes us through a classic introductory problem to Bayesian statistics:

I find that when teaching statistics (and probability) it is often helpful to simulate data first in order to get an understanding of the problem. The Monty Hall problem recently came up in a class so I implemented a function to play the game.

The Monty Hall problem results from a game show, Let’s Make a Deal, hosted by Monty Hall. In this game, the player picks one of three doors. Behind one is a car, the other two are goats. After picking a door the player is shown the contents of one of the other two doors, which because the host knows the contents, is a goat. The question to the player: Do you switch your choice?

This is one of the biggest “aha!” moments in statistics, in the sense that it is not intuitively obvious and is easy to get wrong, but once you understand why it is true, it makes reasoning over time and knowledge changes easier. H/T R-Bloggers.

Leave a Comment

Accounting for Index Usage over Availability Groups

Aaron Bertrand wants a cross-replica accounting system:

As a part of optimizing performance, I evaluate index usage across many instances and databases. I often find that some indexes aren’t used much or, at least on first glance, appear to go unused. Since we use availability groups (AG), different workloads run against replicas in different roles. All writes happen on the primary, obviously. However, some queries only happen on read-only secondaries (either because read-only routing is in use, or because some processes are manually directed at specific secondaries, or both). Unfortunately, index usage is not rolled up anywhere across all replicas. This means that looking at the primary alone gives an incomplete picture. How do I make sure I account for index activity everywhere, not just on the primary?

This is part one of a series but already has enough to get us started.

Leave a Comment

Postgres Migration via Logical Replication

Elizabeth Christensen makes a move:

Moving a Postgres database isn’t a small task. Typically for Postgres users this is one of the biggest projects you’ll undertake. If you’re migrating for a new Postgres major version or moving to an entirely new platform or host, you have a couple options:

Read on for those three options, when logical replication can possibly work, and the process for upgrade. It’s definitely a bit more fiddly than other options, but it’s hard to beat with respect to downtime.

Leave a Comment

Updates to Fabric Data Factory

Abhishek Narain has a list of updates:

Workspace Private Link Support for Data Factory (Preview): Microsoft Fabric enables secure data integration through Private Link support in Dataflows Gen2, Pipelines, and Copy jobs. This ensures that inbound data access remains isolated and compliant within protected workspaces. By leveraging VNet data gateways, organizations can securely connect to data sources across Private Link-enabled environments—eliminating exposure to public networks and reinforcing enterprise-grade security for sensitive data operations.

Most of these are security-related updates, with a mixture of things now GA, things currently in preview, and a pair of items coming soon.

Leave a Comment

Error Handling in PySpark Jobs

Ram Ghadiyaram adds some error handling logic:

In PySpark, processing massive datasets across distributed clusters is powerful but comes with challenges. A single bad record, missing file, or network glitch can crash an entire job, wasting compute resources and leaving you with stack traces that have many lines. 

Spark’s lazy evaluation, where transformations don’t execute until an action is triggered, makes errors harder to catch early, and debugging them can feel like very, very difficult.

Read on for five patterns that can help with error handling in PySpark.

Leave a Comment

Performance of User-Defined Functions in Fabric Warehouses

Jared Westover shares some findings:

In Part One, we saw that simple scalar user-defined functions (UDFs) perform as well as inline code in a Fabric warehouse. But with a more complex UDF, does performance change? If it drops, is the code-reuse convenience worth the price?

I’m surprised that the performance profile was so good. I had assumed it would perform like T-SQL user-defined functions—namely, worse in general.

Leave a Comment