Month: May 2025

Combining DISTINCT and UNION

Published 2025-05-27 by Kevin Feasel

Louis Davidson gives it the college try:

When I was perusing my LinkedIn feed the other day, I came across this thread about using SELECT *. In one of the replies, Aaron Cutshall noted that: “Another real performance killer is SELECT DISTINCT especially when combined with UNION. I have a whole list of commonly used hidden performance killers!”

To which started my brain thinking… What does happen when you use these together? And when you use UNION on a set with non-distinct rows, what happens. So for the next few hours I started writing.

Read on for Louis’s findings.

Comments closed

Trying out Microsoft Fabric Data Agents

Published 2025-05-27 by Kevin Feasel

Wolfgang Strasser gives a generative AI solution built into Microsoft Fabric a try:

Today, I wanted to give the new Fabric Data Agents a try. According to the documentation, a Fabric Data Agent is defined as follows:

Data agent in Microsoft Fabric is a new Microsoft Fabric feature that allows you to build your own conversational Q&A systems using generative AI. A Fabric data agent makes data insights more accessible and actionable for everyone in your organization. With a Fabric data agent, your team can have conversations, with plain English-language questions, about the data that your organization stored in Fabric OneLake and then receive relevant answers. This way, even people without technical expertise in AI or a deep understanding of the data structure can receive precise and context-rich answers.

Let’s give it a try and build our first Data Agent.

Click through for the pre-requisites, the setup process, and how everything looked for Wolfgang.

Comments closed

sqlcmd in SQL Server 2025 and Certificate Chain Not Trusted

Published 2025-05-27 by Kevin Feasel

Vlad Drumea points out a new thing to keep an eye on:

SQL Server 2025 provides ODBC sqlcmd version 17 which enforces an encrypted connection.

If you’re trying to use it to connect to instances that don’t have a CA-signed certificate or where TLS encryption was never properly configured, sqlcmd will throw the famous “certificate chain not trusted” error message:

Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : SSL Provider: The certificate chain was issued by an authority that is not trusted.
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection.

The proper answer to this is to get trusted certificates. The workaround is what Vlad describes, so click through for that.

Comments closed

Preventing Injection Attacks in Shiny

Published 2025-05-23 by Kevin Feasel

Arthur Breant shares some advice:

Code injection is a common security vulnerability that involves injecting malicious code into a page or application. This code is then executed, creating the security breach. There are several ways to inject code into an application, and Shiny is unfortunately not immune to these risks.

Click through for a quick overview of the three most common types of injection attack. There’s nothing special about Shiny here—any system that executes code based on user input is potentially vulnerable to injection attacks—so it is good to keep these tips in mind. H/T R-Bloggers.

Comments closed

Building an ML-Friendly Data Lake with Apache Iceberg

Published 2025-05-23 by Kevin Feasel

Anant Kumar designs a data lake:

As companies collect massive amounts of data to fuel their artificial intelligence and machine learning initiatives, finding the right data architecture for storing, managing, and accessing such data is crucial. Traditional data storage practices are likely to fall short to meet the scale, variety, and velocity required by modern AI/ML workflows. Apache Iceberg steps in as a strong open-source table format to build solid and efficient data lakes for AI and ML.

Click through for a primer on Iceberg, how to set up a fairly simple data lake, and some functionality that can help in model training.

Comments closed

Loading JSON into a Microsoft Fabric Eventhouse

Published 2025-05-23 by Kevin Feasel

Christopher Schmidt loads some data:

In the era of big data, efficiently parsing and analyzing JSON data is critical for gaining actionable insights. Leveraging Kusto, a powerful query engine developed by Microsoft, enhances the efficiency of handling JSON data, making it simpler and faster to derive meaningful patterns and trends. Perhaps more importantly, Kusto’s ability to easily parse simple or nested JSON makes it easier then ever to extract meaningful insights from this data. The purpose of this blog post is to walk through ways that JSON data can be loaded into Eventhouse in Microsoft Fabric, where you can then leverage Kusto’s powerful capabilities for this. I’ve tried this a few different ways, and the below approach is the fastest, most efficient low-code way to ingest the data into the Eventhouse. As JSON inherently supports different schemas in a single file, the expectation here is that we have a json file with varying schemas within a single file, and we would like to load this into our Eventhouse for efficient parsing with KQL.

Read on for the process.

Comments closed

Interpreting V$ and GV$ Views in Oracle RAC

Published 2025-05-23 by Kevin Feasel

Kellyn Gorman continues a series on Oracle Real Application Clusters:

Furthering on our Oracle Real Application Clusters (RAC) knowledge, we’re going to go deeper into what we watch for a RAC database that may be different than a single instance. RAC is built for scale and instance resilience, distributing workloads across multiple nodes. At the same time, what gives it strength introduces monitoring complexity, especially when you’re not just watching a single instance but multiple, interconnected ones. To manage performance effectively in RAC, you need to understand the difference between V$ and GV$ views, what they show you, and how to interpret cluster-level wait events. Along with performance, the overall health of the RAC cluster and interconnect must be known, too.

Click through for Kellyn’s explanation.

Comments closed

Paste a List of Values into a Power BI Slicer

Published 2025-05-23 by Kevin Feasel

Dan English doesn’t want to click over and over:

Have you ever wanted to take a list of values from say an Excel spreadsheet and paste those into a Power BI slicer to filter the list? Like say you are only interested in particular set of items, but the list of items is long and filtering through a list of say a thousands values can take a while. I bet you have and this has been an item that has been requested for a very long time going back to 2017!

Well believe it or not, I just found out this week during a meeting with the product team it has been released!!

Click through for the limitations, as well as a demo of how it works.

Comments closed

Optional Parameter Plan Optimization in SQL Server 2025

Published 2025-05-23 by Kevin Feasel

Brent Ozar is down with OPP(O):

SQL Server 2025 improved PSPO to handle multiple predicates that might have parameter sensitivity, and that’s great! I love it when Microsoft ships a v1 feature, and then gradually iterates over to make it better. Adaptive Memory Grants were a similar investment that got improved over time, and today they’re fantastic.

SQL Server 2025 introduces another feature to mitigate parameter sniffing problems: Optional Parameter Plan Optimization (OPPO). It ain’t perfect today – in fact, it’s pretty doggone limited, like PSPO was when it first shipped, but I have hopes that SQL Server vNext will make it actually usable. Let’s discuss what we’ve got today first.

Okay, I really had to stretch the truth to make my lead-in work, but I’m too proud of it to change anything. Click through to see where OPPO is today. Even with just one optional parameter working well, there is still a class of stored procedures that this can help: the “get by one ID, or get me all of them” type.

Comments closed

Handling Large Delete Operations in TimescaleDB

Published 2025-05-23 by Kevin Feasel

Semab Tariq deletes a significant amount of data:

In today’s blog, we will discuss another crucial aspect of time-series data management: massive delete operations.

As your data grows over time, older records often lose their relevance but continue to occupy valuable disk space, potentially increasing storage costs and might degrade the performance if not managed well.

Let’s walk through some strategies to clean up or downsample aged data in TimescaleDB, helping you maintain a lean, efficient, and cost-effective database.

The “or downsample” is huge, by the way: as a simple example, suppose you collect one record every millisecond, or 1000 per second. Say that we have a date+time and a few floating point numbers that add up to 40 bytes per record. If we have a year of data at that grain, we have 40 bytes/record * 1000 records/second * 3600 seconds/hour * 24 hours/day * 365.25 days/year, or 1,262,304,000,000 bytes/year. That’s ~1.15 terabytes of data per year, assuming no compression (which there actually is, but whatever). By contrast, if you keep millisecond-level data for a week, second-level for 3 weeks, and minute-level for the remaining year, you have:

40 bytes/record * 1000 records/second * 3600 seconds/hour * 24 hours/day * 7 days/week * 1 week = 22.53 gigabytes
40 bytes/record * 1 record/second * 3600 seconds/hour * 24 hours/day * 7 days/week * 3 weeks = 69 megabytes
40 bytes/record * 1 record/minute * 60 minutes/hour * 24 hours/day * 337.25 days = 18.5 megabytes

And for most cases, we only need the lowest level of granularity for a relatively short amount of time. After that, we typically care more about how the current data looks versus older data, for the purposes of trending.

Comments closed