Press "Enter" to skip to content

Author: Kevin Feasel

When R^2 Misleads

Holger von Jouanne-Diedrich explains a common quality metric for regression analyses:

A high R^2 can make a regression model look impressively accurate — but this number can be deceptive. If you want to understand why a high R^2 is not always a sign of a good model, read on!

Click through for that explanation. This post does a fantastic job of explaining the technical reasons why a high R^2 might not be indicative of a good model specification. But I’d add one other piece to the puzzle: what constitutes a high R^2 will depend very much on the domain. For example, if you are performing a regression of some process in physics, an R^2 of 0.90 is probably so low as to indicate you’ve made a horrible mistake somewhere to have a number so low.

By contrast, an R^2 of 0.90 in the context of a social studies analysis would get you laughed out of the room for obviously faking the data or misunderstanding the specification to get a number that high.

Leave a Comment

Metadata-Driven Frameworks for Change Detection in Microsoft Fabric

Kevin Chant builds a table:

I had various options for this months contribution due to my experience with various change detection solutions. Including Azure Synapse Link for SQL Server 2022. Which I covered in previous posts. Including one that covered some excessive file tests for Azure Synapse Link for SQL Server 2022.

In the end I decided to cover developing metadata-driven frameworks for Microsoft Fabric. Due to the fact that it is such a hot topic for multiple reasons. One of which is the growing availability of open-source, metadata-driven frameworks for Microsoft Fabric.

Read on for three such frameworks and some advice on how to use them.

Leave a Comment

Creating Better Scatterplots

Ruben Van de Voorde embraces the second dimension:

Scatterplots are in a weird place in Power BI reports. They’re incredibly good at their core business: showing how two metrics relate across many things, like products, customers, or suppliers.

But they can miss the landing in a few ways. Sometimes the relationship itself matters but the chart asks the reader to do too much inference: “Why should I care about a product’s Gross Margin % vs. Shipping Weight?” Other times, the reader can’t tell what the dots actually are. A reader asking “What does one dot represent?” is the clearest tell, sometimes followed by “Can’t this be a table instead of these dots?”

Click through for musings about scatterplots, their bubble plot cousins, and what’s available in DAX and Power BI to make them work for you.

Leave a Comment

Tracking Data Changes

Louis Davidson shows a technique:

A few months ago, I wrote a post about comparing sets of data in SQL, This focussed on a type of challenge that is often a one off challenge. The techniques lists (along with a tool like Redgate’s SQL Data Compare aren’t the point of this post, but they are related because when you do change detection from a source, it is super important to check your results occasionally. Having a copy of the complete source to compare to your destination (even if it is just checksums of the data,) is important.

The techniques that Meagan is asking about this month are more ETL related, where you check a stream of data and sync them as changes are made. Typically, you don’t want to compare all the rows in a set, but just the ones that have changed. At some frequency, give me the changes to one set of data and keep it up to date with another.

I’m still partial to using HASHBYTES() on the fields I care about because I don’t trust modified dates unless I know the only way to access that table is via stored procedure, and all of the stored procedures handle updating the modified date correctly.

For row comparisons, I’d also look at EXCEPT for overall row comparison. You do need to do it in both directions, so the pattern is more like A EXCEPT B UNION ALL B EXCEPT A but it works great and natively handles any NULLs along the way.

Leave a Comment

Semantic Sonar: Canary Testing Power BI Semantic Models

Jens Vestergaard has a new tool:

Quarterly sales season always brings the same ritual: a business user opens Power BI, clicks into the sales dashboard, and waits for the numbers to load. It has been a while since anyone looked at this particular report. The numbers look off. Someone hits refresh. The spinner turns, but the data does not budge. The last refresh was weeks ago.

That is when the emails start. “Can you check why the sales model hasn’t updated?” The support thread grows. IT investigates. The answer is usually not in the data. The refresh failed at 3 AM. Or the data source changed its connection string last Tuesday. Or someone revoked the service principal without telling anyone.

By the time the root cause is found, the report has been open in meetings for hours, and the numbers have been wrong for days.

Read on for a free tool that can prevent this sort of issue.

I think the biggest IT super-power is the ability consistently to troubleshoot a problem from limited information. The second-biggest super-power is the ability to discover issues before the people who rely on your services do. Then you get them fixed and they have no idea there was even a potential problem.

Leave a Comment

Using Fabric Data Wrangler for Testing

Kristina Mishra checks out some data:

Data Wrangler has been available for awhile now, but I’ll be honest, it’s not something we’ve been actively using. We’ve been heads down on time-sensitive projects for over a year and needless to say, our cup runneth over. Recently we’ve had a bit of respite and I decided to see how we could use Data Wrangler within the context of our current Microsoft Fabric data warehouse (i.e. medallion layer lakehouses).

Data Wrangler has a lot of cool features that will give you code snippets for what you want to do, but I wanted to use it a different way. I wanted to have an easy way to do a quick check for dimension tables. I also wanted an easy-peasy way for others, some of whom are not developers, to be able to do quick sanity check of the data.

Click through to see how it works.

Leave a Comment

Lessons Learned from Change Data Capture

Deborah Melkin shares some lessons from working with Change Data Capture:

This has definitely been something that is I’ve had some experience with recently. It inspired my “Change Tracking in SQL Server 2025: Exploring Change Event Streaming vs CDC” session that I presented at SQLCON 26. We had been having problems troubleshooting various issues around Change Data Capture (CDC) and someone suggested that I take look at Change Event Streaming, which is new for SQL Server 2025. One of the great things about putting together sessions like this is that I was able to create a very simple POC to understand it all works. It was also helpful for me to understand some key takeaways with both of these.

The Change Event Streaming lessons are fairly limited (for good reason) but Deb shares some nice tips on working with CDC.

Leave a Comment

Updating the Documentation on Forced Parameterization

Brent Ozar shares an update:

I love me some documentation.

For years, I’ve pointed folks to the SQL Server 2008 documentation on Forced Parameterization, a really useful tool for reducing plan cache bloat, getting more accurate reusable query plans, and enabling SQL Server 2019 and 2022’s Intelligent Query Plan features, many of which rely on the same query text coming in repeatedly over time in order to tune it. That documentation was pretty sparse, though.

Click through for a link to the new documentation and what it entails. Some of us gripe about the documentation and some of us fix it.

Me? I sit up in the balcony with Waldorf and laugh.

Leave a Comment

Tracking Record Changes in SQL

Andy Brownsword builds a hash key:

The issue: there was no indicator of which records had been modified and as a result the process took way too long, and downstream reporting wasn’t available on time.

After reviewing and stepping through the process it became clear that the vast majority of data didn’t change. This was a daily process handling 12 months of data, yet over 99% had no changes at all. However the process ingested the whole dataset (~250m records) and processed it in SQL.

Click through for an architectural-level discussion. In practice, HASHBYTES() works really well, especially when you use CONCAT() or CONCAT_WS() to put together the columns you care about

Leave a Comment

Tips for Reading an Explain Plan

Jamal Hansen has a primer on explain plans for Python developers:

We talked early in this series about SQL being a declarative language. You tell the database what you want, and it figures out how to get it. But we’ve also seen that SQL gives you the freedom to do things in many ways, and some of those ways are more efficient than others.

Sometimes, a slow query means you didn’t choose the most efficient approach. Other times, your data has simply outgrown the default way the database finds records, and you need to give it a little help.

And in a judo move, if you already understand how explain plans work, you can figure out how to perform code profiling in Python.

Leave a Comment