Press "Enter" to skip to content

Month: May 2025

Using Multiple Scales with ggplot2 and ggnewscale

Zhenguo Zhang resets the scale:

In one ggplot figure, normally you can only use one scale for each aesthetic mapping. For example, if you use scale_color_manual() to set the color scale for a layer, you cannot use another scale_color_manual() for another layer, or set the color scale more then once in the function aes(). However, you can use the new_scale_color() function from the ggnewscale package to add a new scale for the same aesthetic mapping in different layers.

In this post, I will showcase how to use the new_scale_color() function to add two different color scales in a ggplot figure. The first scale will be for a discrete variable (e.g., number of cylinders), and the second scale will be for a continuous variable (e.g., density level).

Click through for the code and a demonstration.

Leave a Comment

Git Branching for Small Teams

Adron Hall takes us through a branching strategy:

Git. It’s the tool that makes some of us developers wonder why they didn’t become a carpenter. But let’s face it: Git is here to stay. And for a small team—like, say, 3-4 developers working on the same codebase—getting your branching strategy right can be the difference between smooth sailing and a storm of merge conflicts that will make you question every decision you’ve ever made in life.

So let’s dive into a “simple” strategy for keeping Git under control. No complex workflows, no corporate jargon—just a few solid, time-tested practices to keep you from drowning in source control hell. Because seriously, git is actually super easy and a thousand times better than all the garbage attempts at source control that came before.

Click through for Adron’s advice. Feature branches start making since once you have more than 2 or maybe 3 developers working in the same repo.

Leave a Comment

Partitioning in PostgreSQL

Umair Shahid takes us into partitioning strategies in PostgreSQL:

My recommended methodology for performance improvement of PostgreSQL starts with query optimization. The second step is architectural improvements, part of which is the partitioning of large tables.

Partitioning in PostgreSQL is one of those advanced features that can be a powerful performance booster. If your PostgreSQL tables are becoming very large and sluggish, partitioning might be the cure. 

It’s interesting to compare this against SQL Server, where partitioning is not a strategy for query performance improvements.

Leave a Comment

Shortcut Caching in Microsoft Fabric now GA

Trevor Olson announces a feature has become generally available:

Shortcuts in OneLake allow you to quickly and easily source data from external cloud providers and use it across all Fabric workloads such as Power BI reports, SQL, Spark and Kusto.  However, each time these workloads read data from cross-cloud sources, the source provider (AWS, GCP) charges additional egress fees on the data. Thankfully, shortcut caching allows the data to only be sourced once and then used across all Fabric workloads without additional egress fees.

This is useful for data that hardly ever changes, and Trevor also shows you who can control the cache length and reset the cache. In addition, the on-premises gateway for shortcuts is now generally available, so you can take shortcuts of certain on-prem file systems.

Leave a Comment

Set-Based Comparisons for Data Validation

Jeffry Schwartz looks for exceptions:

Given the complexity, I realized that validating all intermediate and final result sets was essential to ensure that tuning changes did not alter any report results.  To support this validation, I saved interim and final result sets into tables for direct comparison.

For these comparisons, the EXCEPT and INTERSECT operators proved invaluable. 

Click through for the full story. I’ve always liked using these set operations in ETL jobs because they automatically know how to handle NULL, so this approach is more robust than rigging your own comparisons.

Leave a Comment

sudo in Windows

Patrick Gruenauer elevates our access:

Sudo for Windows is a new way for users to execute commands with elevated privileges (as an administrator) directly from a non-relevant console session on Windows.

The following requirements apply to the use of sudo in Windows:

  1. Windows 11 24H2
  2. Sudo needs to be enabled

Click through to see how to activate sudo. The English-language header reads “System > For Developers” and the exact setting is at the bottom of the first section and has the name “Enable sudo” with a toggle switch. The number of times I’ve run a command just to see it error out because I needed to be in an administrative command prompt or PowerShell terminal is high enough that I immediately turned it on.

But importantly, this is different from Linux, in that it opens up a new command prompt or PowerShell terminal rather than executing the command with elevated permissions in the same prompt. This is important because that new prompt goes away after the command finishes, so you lose the output. In other words, if you run sudo ipconfig in a command prompt, it will hit you with a UAC request (depending on how you’ve configured your PC) and then run ipconfig in a new command prompt, which disappears as soon as the command finishes. You don’t get to keep what was in stdout. I think this limits some of the capability of the option, unfortunately.

Leave a Comment

Parameterization and Mocking in Python Tests

Aida Gjoka and Russ Hyde show off some capabilities in the pytest library:

Writing tests is one of the best ways to keep your code reliable and reproducible. This post builds on our previous blog about Python testing with pytest Part 1, and explores some of the more advanced features it offers. From parametrised fixtures to mocking and other useful pytest plugins, we will show how to make your tests more reproducible, easier to manage and demonstrate how writing simple tests can save you time in the long run.

Click through to learn more. I’m a huge fan of parameterization in pytest—it’s really easy to do. Mocks are a bit harder to pull off in practice, though quite useful.

Leave a Comment

Real-Time Data Streaming in Snowflake

Anil Kumar Moka streams some data:

Real-time data ingestion has become essential for modern analytics and operational intelligence. Organizations across industries need to process data streams from IoT sensors, financial transactions, and application events with minimal latency. Snowflake offers two robust approaches to meet these real-time data needs: Snowpipe for near-real-time file-based streaming and Direct Streaming via Snowpark API for true real-time data integration.

This guide explores both options in depth, providing detailed implementations with explanation of code parameters, performance comparisons, and practical recommendations to help you choose the right approach for your specific use case.

Click through to see how it works. I’ll only make one semi-snarky comment that ‘real-time’ doesn’t mean “takes several seconds” but I realize I’m the one tilting at windmills here.

Leave a Comment

Checking Key Vault Access in Microsoft Fabric Spark Notebooks

Marc Lelijveld has clearance:

Working with sensitive data in Microsoft Fabric requires careful handling of secrets, especially when collaborating externally. In a recent customer engagement, I needed to validate access to Azure Key Vault from within a Fabric Notebook, without ever exposing the actual secret values. With only read access granted and no need to manage or update secrets, I focused on confirming that the connection was working as expected.

In this blog, I’ll walk you through the approach, including the setup, code snippets, and logic behind this quick but crucial verification step.

Click through for the full story.

Leave a Comment