Press "Enter" to skip to content

Month: September 2025

Comparing Spark Application Performance in Microsoft Fabric

Jenny Jiang announces a new capability:

The Spark Applications Comparison feature is now in preview in Microsoft Fabric. This new capability empowers developers and data engineers to analyze, debug, and optimize Spark performance across multiple application runs—whether you’re tracking changes from code updates or data variations to improve performance.

The image in the blog post is pretty small and hard to read, but I do wonder if (or how well) it will capture cases where you’re twiddling your thumbs to get a machine so that you can execute your code. This seems to be a big problem sometimes.

Leave a Comment

Contrasting Microsoft Fabric, Databricks, and Snowflake

Ron L’Esteve builds a comparison chart:

Databricks and Microsoft Fabric are two of the most innovative Unified Data and Analytics intelligence platforms available on the market today. While similar, each brings their own advantages and limitations. Snowflake joins these two powerhouses when data warehouse decisioning comes into play. Sometimes it is challenging to decide which one to pick for your organization’s needs. This tip will help with uncovering when to choose Databricks vs Fabric vs Snowflake.

When it comes to Spark performance, Databricks is always going to win—they keep most of their optimizations to themselves, so anyone starting from open-source Spark is at a disadvantage. Otherwise, it’s a bit of a slugfest between Fabric and Databricks. At the end, Ron also brings in Snowflake, focusing on the data warehousing side of things for that three-way comparison. I don’t think there’s a clear winner among the three, and on net, that’s probably a good thing, as it forces the groups to continue competing.

Leave a Comment

More Types of Window Functions in SQL Server

I continue a series on window functions:

In this video, I continue a dive into each category of window function, quickly reviewing the four categories of window function (plus ordered set functions). Then, I cover offset window functions, statistical window functions, and ordered set functions.

This video includes some of the window functions I use most often (LAG(), LEAD()), some of the window functions I use least often, and even a few ordered set functions to boot. Combined, it’s about 45 minutes of content between this video and the prior.

Leave a Comment

Grouping Options in T-SQL

Chad Callihan rolls up the data:

When learning T-SQL, I’d wager that learning GROUP BY comes up early in the process. What may not be mentioned are the variations that can be added to a GROUP BY clause. Are you familiar with GROUP BY GROUPING SETS, GROUP BY ROLLUP, and GROUP BY CUBE? If you’ve never seen these used, or if you have and want a refresher, read on as we look at an example of each.

Of the three, CUBE is the one that I’ve used the least. I’ve found good instances where ROLLUP gives me exactly what I want for reporting purposes, and GROUPING SETS is powerful enough that I’ve made use of it a fair number of times. But CUBE just returns back too many combinations for what I’ve needed.

Leave a Comment

Bitmap Indexes and Deadlocks in Oracle

David Fitzjarrell looks at bitmap indexes:

Bitmap indexes can be very useful, especially when NULL columns are present, as a bitmap index will include such values when btree indexes may not, such as entirely null index keys. Unfortunately bitmap indexes do not behave well with concurrent transactions, where deadlocks may arise because of the bitmap index.

Oracle will trap, report and “resolve” deadlocks by assessing the situation, determining which session created the deadlock and killing the ‘offending’ session, with no manual intervention required. The trace file generated reports this as an issue with application coding and/or logic and in many cases this is the likely cause. Enter the bitmap index and a concurrent transaction and, mysteriously, a deadlock may appear, confounding the developer and the DBA.

Read on to learn more about how bitmap indexes can provide a (potentially) strange source of deadlocks.

Leave a Comment

RAISERROR vs THROW

Andy Brownsword looks at the two ways to bubble up an error in SQL Server:

I don’t use RAISERROR often – I usually forget which severity code to use. After looking at a sprinkling of them recently I decided it was time for a refresher, so come along for the ride.

If you check out the online documentation it states that “New applications should use THROW instead”. It also sounds like its used to raise ‘RROR’s (whatever they are?). Neither are quite the whole story though. Let’s get into it.

My general rule of thumb is that I tend to use THROW most of the time, but RAISERROR in loops so that I can print out how far along in the process something is, as there is no WITH NOWAIT equivalent to THROW. Andy mentions using THROW; without additional parameters, and that’s very helpful when you want to maintain the original error message rather than wrapping your own around it. It’s not quite as useful as a re-throw in a language like C#, where you keep stack trace information, but helps with troubleshooting.

As for not doubling the letter if it is the last letter of the first word and first letter of the second word (raise error or help protect), it was the fashion at the time, like wearing a yellow onion on your belt. I suppose the intent was to prevent typos or make it look slightly better, but I’ve never been a fan.

Leave a Comment

Prereqs for using Power BI’s Analyze in Excel Capability

Nicky van Vroenhoven lays out the rules:

I think I now got this question 4 times in the last months, so I thought I’d write it down so I can reference it later, and point people to it

What are the requirements so (a group of) colleagues can start using Analyze in Excel?

Good question, let me break it down. 
In general, I think it’s also better to use Analyze in Excel than Export to Excel!
Reza Rad also wrote about why that’s important earlier.

Click through for the list of prerequisites and a few things to keep in mind.

Leave a Comment

Cross-Validation and Time Series Data

Vlad Johnson takes us through a technique to test time series results:

Time series modeling, compared to traditional nontemporal modeling, presents unique challenges in ensuring that models generalize well to future, unseen data. One key methodology to address these challenges is cross-validation.

Time series data inherently contains temporal dependencies — observations are ordered in time, and future values may depend on past trends. This structure makes it challenging to estimate how well a model will perform on new, unseen data.

Click through for an explanation of cross-validation, why this becomes challenging when you have time series data (or other serially correlated data), and tips to resolve this challenge.

Leave a Comment

Natural Language Querying in SQL Server

Hadi Fadlallah shells out to an API:

Data is usually the most important asset in organizations, but only SQL developers can frequently access that data. Technical teams often write queries for non-technical users. This restricts agility, slows decision-making, and creates a bottleneck in data accessibility. One possible remedy is natural language processing (NLP), which enables users to ask questions in simple English and receive answers without knowing any code. Still, the majority of NLP-to-SQL solutions are cloud-based, which raises issues with cost and privacy.

This particular solution has nothing to do with the embedding features in SQL Server 2025. Instead, it essentially shells out to an Ollama API and runs the resulting SQL query. It’s reasonably neat but I’d have so many qualms putting anything like this into production.

Leave a Comment