Press "Enter" to skip to content

Author: Kevin Feasel

Choosing between Data Scalers in a Data Science Project

Bala Pirya C performs a comparison:

In this article, you will learn how MinMaxScaler, StandardScaler, and RobustScaler transform skewed, outlier-heavy data, and how to pick the right one for your modeling pipeline.

Topics we will cover include:

  • How each scaler works and where it breaks on skewed or outlier-rich data
  • A realistic synthetic dataset to stress-test the scalers
  • A practical, code-ready heuristic for choosing a scaler

Read on to learn more about each of these three scaler types, the use cases that best fit each of them, and even a flow chart at the end.

Leave a Comment

A Primer on Principal Component Analysis

Harris Amjad explains the basics of principal component analysis:

In this series of tips, we will delve into the unsupervised learning branch of Machine Learning. Principal Component Analysis (PCA) is a powerful technique for dimensionality reduction, but its mathematical foundation involving eigenvalues and eigenvectors can be intimidating. This tip aims to demystify PCA, explaining its purpose, how it works, and its use in visualizing high-dimensional data.

Click through to learn how it works. This is a solid primer.

Leave a Comment

VARCHAR or NVARCHAR

Brent Ozar asks a question:

You’re building a new table or adding a column, and you wanna know which datatype to use: VARCHAR or NVARCHAR?

If you need to store Unicode data, the choice is made for you: NVARCHAR says it’s gonna be me.

But if you’re not sure, maybe you think, “I should use VARCHAR because it takes half the storage space.” I know I certainly felt that way, but a ton of commenters called me out on it when I posted an Office Hours answer about how I default to VARCHAR. One developer after another told me I was wrong, and that in 2025, it’s time to default to NVARCHAR instead. Let’s run an experiment!

This is going back a long way (June of 2020) but one of my earliest YouTube videos is entitled NVARCHAR Everywhere. I’ve gotten a lot better at presentation skill since then (and have a much nicer camera), but I still stand by the arguments.

Leave a Comment

Tracking Memory Consumption in Fabric SQL Database

Lance Wright tracks memory utilization:

SQL Database in Fabric continues its commitment to providing you with robust tools for database management, performance monitoring, and optimization. Earlier this year, we released a performance dashboard to help you monitor and improve the performance of your SQL Database in Fabric. We’ve improved upon those performance monitoring capabilities with the ability to track memory consumption. This new capability delivers real-time, actionable data regarding the memory utilization of all database queries to help you make more informed decisions and manage SQL Database resources more efficiently.

Read on to see what you can do with this.

Leave a Comment

Narrowing down Slowdown Causes in SQL Server

Kevin Hill continues a series on solving the age-old “The server is slow!” problem:

At this point you’ve:

  • Defined what “slow” means and built a timeline (Part 1).
  • Checked things outside SQL Server like network, storage, and VM noise (Part 2).

Now it’s time to open the hood on SQL Server itself.

I think Kevin’s checklist is a pretty solid one for the type of client he often deals with: one without an in-house DBA or the expertise to stay on top of server problems.

Leave a Comment

Row Expansion in T-SQL

Louis Davidson solves a problem:

On LinkedIn a few days ago, there was a question that I found interesting about what was purported to be an interview question. The gist was “say you have a set that looks like this:

OrderId Item Quantity
------- ---- ---------
O1      A1   5
O2      A2   1
O3      A3   3

and you need to expand it to be one row based on the value in Quantity

Admittedly, this kind of problem is fairly uncommon in the business world, though this is exactly the sort of thing that a tally table can solve, and that’s what Louis uses to solve the problem. Louis also gets brownie points for praising CROSS APPLY along with tally tables.

Leave a Comment

Scaling Kafka Streams Applications

The Confluent employee mines have a new article:

As the adoption of real-time data processing accelerates, the ability to scale stream processing applications to handle high-volume traffic is paramount. Apache Kafka®, the de facto standard for distributed event streaming, provides a powerful and scalable library in Kafka Streams for building such applications. 

Scaling a Kafka Streams application effectively involves a multi-faceted approach that encompasses architectural design, configuration tuning, and diligent monitoring. This guide will walk you through the essential strategies and best practices to ensure your Kafka Streams applications can gracefully handle massive throughput.

The post gets into some details around the kinds of limits you’ll hit during scaling, scale-up versus scale-out, and configuration settings to help with that scale.

Leave a Comment

Use Cases for Window Functions

I have a new video:

In this video, I take you through a variety of use cases for window functions, showing how you can solve common (and sometimes uncommon) business problems efficiently and effectively.

This video builds off of the prior two videos. Those prior two videos showed what the different window functions are and how they work. This one focuses primarily on solving business problems in sometimes-clever ways.

Leave a Comment

Tag-Based Masking in Snowflake

Kevin Wilkie gets tagging:

If you’ve followed our site for a while, you would have seen in a previous post how powerful tag-based masking policies are in Snowflake. They let you enforce consistent data masking rules across columns without constantly rewriting logic. But Snowflake hasn’t stopped there—recent enhancements now make it even easier to classify, tag, and mask data at scale. In this post, we’ll recap the essentials of tag-based masking, highlight the new functionality, and share some practical tips for rolling it out in your environment.

Kevin has a new blog theme and everything.

Leave a Comment

Adding a Drillthrough Button in Power BI

Elena Drakulevska adds a button:

If you’ve been building Power BI reports, you probably know about drillthrough.

In short: drillthrough lets users move from a summary view to a detail page focused on one data point. For example, you can right-click on Austria in a sales chart and jump straight to a page showing visuals and metrics only about Austria.

Sounds powerful, right?

The catch: most users don’t even know it’s been implemented.

The other catch: those of us sad souls using Power BI Report Server don’t get drillthrough at all.

Leave a Comment