Press "Enter" to skip to content

Day: April 14, 2025

Data Splitting and Cross-Validation in R

Nick Han has a pair of articles. First up is on data splitting and pre-processing:

Data preprocessing is a crucial step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In this blog post, we’ll walk through the process of splitting and preprocessing data in R, using the rsample package for data splitting and saving the results for future use.

H/T R-Bloggers for that one.

The second involves using cross-validation via the caret package in R:

Cross-validation is a resampling technique used to assess the performance and generalizability of machine learning models. It helps address issues like overfitting and ensures that the model’s performance is consistent across different subsets of the data. By splitting the data into multiple folds and repeating the process, cross-validation provides a robust estimate of model performance.

H/T R-Bloggers for that as well.

Leave a Comment

An Introduction to Temporal Tables

Stephen Planck covers a feature in SQL Server:

Have you ever wanted to see exactly how a row looked in your SQL Server database a week or a month ago—without writing complicated auditing code or triggers? With temporal tables (also known as system-versioned tables), you can do precisely that. Temporal tables automatically track historical data changes by maintaining multiple versions of each row in a separate history table. Let’s explore how they work, why you might use them, how to set them up, and what best practices to follow.

Click through for a good overview of the feature. I have mixed feelings on the feature because I think it’s halfway-finished and has been since 2016. But there are two things that I think the feature really needs to shine.

The first is user-defined versioning. SQL Server only offers system time for version tracking, meaning it keeps track of when you insert the row into the table. That’s fine for certain historical operations, but terrible if you want to use temporal tables for type-2 slowly changing dimensions, where you often care more about when a record became effective rather than when you inserted the row into the dimension.

The second is efficient historical slicing the same way you can do AS OF operations. AS OF lets you ask questions about what the data looked like at a specific point in time. For warehousing scenarios, we also want to look at the history of changes for a particular keyed record, so you might see all of the changes to a customer or an employee. You can do this with a UNION ALL operation, but that query logic can get complex.

Leave a Comment

Estimating SQL Server Backup Sizes

Rebecca Lewis gives us a rule of thumb:

How big will my backup file be?  Very good question.  And of course, it depends… especially if you’re using compression.

I was working something today that required a backup beforehand for safety measures, and the server had limited storage.  So, I needed to estimate the backup file size beforehand to confirm whether I had enough space for the bak file.

Click through for the script. It does, of course, include some simplifications and your actual numbers can turn out a bit different, but for a quick and dirty estimate of disk space needed, this isn’t bad at all.

Leave a Comment

Why Use XACT_ABORT?

Josephine Bush answers a team member’s question:

This came up one day at my work when a developer was using it. I hadn’t used it before and thought I’d better check it out. It’s off by default in SQL server, but why would you use it?

Click through for the tl;dr version, followed by a more thorough explanation. I wouldn’t set XACT_ABORT on by default, but there have been plenty of instances in which I’ve wanted to handle a lengthier series of operations as an all-or-nothing job, and this is a great way to do it.

Leave a Comment

Creating a Microsoft Fabric Capacity

Boniface Muchendu builds out some capacity:

To begin, we need to head over to the Azure portal. You might wonder why we are starting here. Well, Microsoft Fabric is now an Azure resource, which means all initial setups must be done in the Azure environment.

Click through for step-by-step instructions. Microsoft has also been really good about letting people create (and re-create and re-create) trial capacities, so if you’re just futzing about with the product to get an idea of what it can do, see if you can use that rather than shelling out the cash.

Leave a Comment

Feeding Language Models Bad Advice

Louis Davidson begins an experiment:

So, I got this idea to test out a few LLM, ChatGPT and the Web and Office Copilot at the very least and see how they handle a load of bad advice. So I put out a question on X, asking:

“A request! Send me your most realistic, but worst, SQL Server management advice. I want to test (and write an article) about using AI to fact check writing.”

And if there’s anything this community is good at, it’s providing bad advice for purposes of lampooning.

We are going to need to wait a week to see Louis’s results, but you can check out some of the terrible advice a variety of X users proffered.

Leave a Comment

Snowflake Query Tags in Power BI

Chris Webb takes some of the shine off of things:

Since the November 2024 Power BI release blog post announced that queries sent to Snowflake by Power BI include a query tag I’ve had a lot of questions from people who couldn’t see this happening or wanted to know what the query tags contained, so in this blog I thought I would outline the current status.

It turns out that the query tag isn’t as far along as the blog post indicated, and there are some pretty big limitations in the cases in which there actually is tagging.

Leave a Comment