Press "Enter" to skip to content

Curated SQL Posts

The Consequences of Hitting Semantic Model Guardrails

Chris Webb smashes into a wall:

Direct Lake mode in Power BI allows you to build semantic models on very large volumes of data, but because it is still an in-memory database engine there are limits on how much data it can work with. As a result it has rules – called guardrails – that it uses to check whether you are trying to build a semantic model that is too large. But what happens when you hit those guardrails? This week one of my colleagues, Gaurav Agarwal, showed me the results of some tests that he did which I thought I would share here.

Click through to see what happens when you go past one of those guardrails.

Leave a Comment

Load Testing SQL Server with HammerDB and Docker

Anthony Nocentino announces a new tool:

I’m excited to announce the release of a new open-source project that fully automates HammerDB benchmarking for SQL Server using Docker. If you’ve ever needed to run TPC-C or TPC-H benchmarks multiple times, you know how time-consuming the manual setup can be. This project removes the hassle and gets you up and running a single command: ./loadtest.sh.

Click through to learn more about the project and how you can grab the code.

Leave a Comment

Batching Large Data Operations via Key Ranges

Andy Brownsword updates or deletes a batch of rows:

Effective batching in general helps us by:

  • Reduce transaction length and minimise blocking
  • Avoids unnecessary checking of the same rows repeatedly
  • Introduce graceful pacing to reduce impact on busy environments or data replication

I’m not the biggest fan of the OFFSET/FETCH combination there, at least if your key column is fairly well packed—like, say, 99+% of the rows are contiguous and you occasionally have a jump of a few thousand rows. Also, that batch size of 100K might be a little high, although that will certainly depend on what the operation is. Batch updating a column based on some fairly straightforward calculation? You can probably get away with 100K, though I’d still prefer 10K. But as you add more complexities (deleting rows, very high server throughput, triggers, limited hardware, etc.), that number should edge downward.

Leave a Comment

What’s New in Apache Kafka 4.1.0

Mickael Maison lays out some changes:

The Apache Kafka community is proud to announce the release of Apache Kafka® 4.1.0. This blog post highlights the many new features and improvements included in this release. For a full list of changes, be sure to check the release notes.

Queues for Kafka (KIP-932) is now in preview. It’s still not ready for production, but you can start evaluating and testing it. See the preview release notes for more details.

This release also introduces a new Streams Rebalance Protocol (KIP-1071) in early access. It is based on the new consumer group protocol (KIP-848).

Read on for another 15 or so completed items.

Leave a Comment

A Primer on Markdown

Mike Robbins introduces Markdown:

Markdown is the standard for writing technical documentation at Microsoft and many other organizations. Its simplicity, readability, and compatibility with other tools make it an ideal choice for blogging, documenting software, procedures, APIs, and more. Whether you’re authoring a user guide, README, or knowledge base article, Markdown enables you to focus on content without getting bogged down in formatting.

As a technical writer, you’re expected to deliver clear, maintainable documentation that works across platforms. Markdown helps you do exactly that, with minimal friction.

The biggest challenge I experience with Markdown is figuring out what’s actually supported in some given implementation of Markdown. Most of the basics will be the same, but as soon as you get into things like nested lists, images, etc., support varies significantly.

Leave a Comment

Defending Kubernetes

Joey D’Antoni defends the defensible:

I’ve seen a couple of posts (of course they were chock full of AI slop images) on LinkedIn in the last couple of weeks, talking about how challenging it is to implement Kubernetes. In the most recent post I saw, it stated that “it took 5 months for our CEO to implement Kubernetes for our app”, to which I would ask, why the hell is your CEO configuring your clusters. I designed, and implemented the Kubernetes infrastructure on my current project, and I’ve worked on for a while, so of course, I felt the need to share my opinions on the matter.

As far as Kubernetes on-premises goes, there are quite valid reasons to run it on-prem. Yeah, it’s easier to host in AKS or EKS, but that’s not always possible. But regardless of whether you’re hosting on-prem or in a cloud provider, Kubernetes requires solid knowledge across several areas, including networking, storage administration, systems administration, and CI/CD, not to mention the development skills needed for containerization.

I think Joey downplays the skill level required, but I don’t want to err in the opposite direction by overstating the challenge. But if you want anything beyond the bog-standard deployment of AKS/EKS, the “You must be this tall to ride the ride” sign is significantly higher than using other containerized solutions like Azure Application Services/Container Apps or Elastic Container Service.

Leave a Comment

Building Storage Tiers with Pure Storage in Powershell

Anthony Nocentino creates a medallion storage layout:

In modern IT environments, not all workloads require the same level of storage performance, protection, or cost. Some applications need high performance with aggressive data protection, while others are perfectly fine with lower performance in exchange for cost savings. This tiered approach to storage service delivery is fundamental to efficient infrastructure management.

In my previous post on Fusion, I took an application-centric approach, showing how to deploy SQL Servers using Fusion. Let’s switch gears now and learn how to define a storage service catalog. In this post, I’ll demonstrate how to build a complete storage service catalog using Pure Storage Fusion Presets, offering Bronze, Silver, and Gold tiers with optional replication. We’ll see how to leverage different array types (FlashArray //X and FlashArray //C) to optimize both performance and cost across your fleet.

Read on for a link to the code, as well as more information on how it works.

Leave a Comment

Join Strategies in Apache Spark

Ram Ghadiyaram looks at three join strategies in Apache Spark:

In this article, we are going to discuss three essential joins of Apache Spark.

The data frame or table join operation is most commonly used for data transformations in Apache Spark. With Apache Spark, a developer can use joins to merge two or more data frames according to specific (sortable) keys. Writing a join operation has a straightforward syntax, but occasionally the inner workings are obscured. Apache Spark internal API suggests several algorithms for joins and selects one. A basic join operation could become costly if you do not know what these core algorithms are or which one Spark uses.

This is not a comprehensive list, but it does cover three of the more common strategies when dealing with larger datasets.

Leave a Comment

What’s New in Microsoft Fabric Data Warehouse

Sowmya Sivaraman has an update:

Welcome to the August 2025 edition of What’s New in Fabric Warehouse. As summer winds down, despite August being a slower month, our team continued to deliver meaningful updates. We shipped several new features focused on enhancing data ingestion, improving the data management, and streamlining security. At the same time, much of our energy is going into preparing exciting announcements for FabCon Vienna — stay tuned for what’s coming next. Whether you’re optimizing workloads, building with SQL, or exploring new integrations, this roundup highlights improvements we think you’ll find valuable.

Click through for a list of changes.

Leave a Comment