Python – Curated SQL

Versatile, interpretable, and effective for a variety of use cases, decision trees have been among the most well-established machine learning techniques for decades, widely used for classification and regression tasks. Yet, they are still widely used — whether as standalone models or as components of more powerful ensemble methods like random forests and gradient boosting machines.

And there is one more attractive feature that pushes the boundaries of their versatility even further: they can accommodate data in diverse formats, beyond just fully structured, tabular data. This article examines this facet of decision trees from a balanced theoretical and practical approach.

Click through for an example.

From Pandas to Polars

Published 2025-06-19 by Kevin Feasel

Ivan Palomares Carrascosa provides an introduction to the polars library:

Polars is currently one of the fastest open-source libraries for data manipulation and processing on a single machine, featuring an intuitive and user-friendly API. Natively built in Rust, it is designed to optimize low memory consumption and speed while working with DataFrames.

This article takes a tour of Polars library in Python and illustrates how it can be seamlessly used similarly to Pandas to efficiently manipulate large datasets.

My experience with polars is that it’s not a 1:1 replacement for pandas, but the interfaces are similar enough that a lot of code can swap over without much effort. And yes, it’s typically faster.

Comments closed

Multithreading and Multiprocessing in Python

Published 2025-06-17 by Kevin Feasel

Jessica Wachtel explains how two systems work in Python:

Let’s use a simple example to understand them: a mechanics shop. Concurrency happens when one mechanic works on several cars by switching between them. For example, the mechanic changes the oil in one car while waiting for a part for another. They don’t finish one car before starting the next, but they can’t do two tasks at exactly the same time. The tasks overlap in time but don’t happen simultaneously.

Click through for the analogy, how it applies to Python, and tips and tricks around each.

Comments closed

Finding Power BI Report History in Microsoft Fabric

Published 2025-06-17 by Kevin Feasel

Meagan Longoria wants to know who viewed these reports:

Lately I have been building scripts to help clients audit their Fabric environment. The latest request was to get a list of reports and what users have viewed them recently. This is a quick job for a Fabric notebook and the Semantic Link Labs library.

Click through for the notebook, as well as an explanation of how it all works.

Comments closed

Loading Data into Snowflake via Python

Published 2025-06-13 by Kevin Feasel

Anil Kumar Moka does a bit of data loading:

In our ongoing exploration of Snowflake data loading strategies, we’ve previously examined how to use pandas with SQLAlchemy to efficiently move data into Snowflake tables. That approach leverages pandas’ intuitive DataFrame handling and works well for many common scenarios where you’re already manipulating data in Python before loading it to Snowflake.

In this article, we’re diving deeper into the Snowflake toolbox by exploring the native Snowflake Connector for Python. While pandas offers simplicity and familiarity, the native connector provides a different set of capabilities focused on precision control and Snowflake-specific optimizations. This article explains you when and how to use this more direct approach for everything from small CSV files to massive datasets that would overwhelm pandas.

Click through for the full article.

Comments closed

Handling Imbalanced Data in Python

Published 2025-06-13 by Kevin Feasel

Ivan Palomares Carrascosa gives three ways to deal with imbalanced data:

Here’s the catch: having imbalanced data usually makes analysis processes more difficult, especially for machine learning models that can easily get biased toward the majority class as a result of dealing with data with a remarkably unequal class distribution, thereby ending up becoming an almost “dummy classifier” that assigns the same class to virtually everything — in the most extreme case.

This article shows several strategies to navigate and handle imbalanced datasets using two of Python’s most stellar libraries for “all things data”: Pandas and Scikit-learn.

Click through for those ways, including sample code.

Comments closed

Custom Libraries in Microsoft Fabric Data Engineering

Published 2025-06-13 by Kevin Feasel

Gerhard Brueckl isn’t content with the defaults:

When working with Spark or data engineering in general in Microsoft Fabric, you will sooner or later come to the point where you need to reuse some of the code that you have already written in another notebook. Best practice is to put these code pieces into a central place from where it can be referenced and reused. This way you can make sure all notebooks always use the very same code and it is also easy to develop, update and test the common functions.

As Gerhard mentions, having common notebooks with utilities is fine for when you’re getting started with development, but being able to centralize functions in proper libraries can make that code a lot more useful, not just in the context of the single notebook.

I believe that this does allow for arbitrary code execution, so someone with sufficient permissions to create a notebook and import code from arbitrary locations would be able to execute that code. I think there are ways of limiting this risk (such as not allowing your Fabric hosts to connect to any remote servers other than ones you explicitly allow), but it’s something I’d have to puzzle through.

Comments closed

Vector Search from Scratch

Published 2025-06-11 by Kevin Feasel

Kanwai Mehreen does a bit of searching:

In this article, I’ll walk you through every step from generating vector representations to searching using cosine similarity, and we’ll even visualize what’s happening behind the scenes. By the end, you’ll not only understand how vector search works but also have a working implementation you can build on. So, let’s get started.

It’s kind of funny how simple this is, but it is. A lot of the complexity is around data quality operations, as well as optimizing the search process.

Comments closed

Debugging Fabric UDFs in Visual Studio Code

Published 2025-06-10 by Kevin Feasel

Sunitha Muthukrishna takes us through a debugging exercise:

Debugging your code is important to identify issues and mitigate them when you’re working with user data functions in Microsoft Fabric. You want to make sure everything works as it should and that’s where local debugging lets you catch problems in your code without messing with the live environment. In this blog post, I will walk you through the steps to make local debugging easier and faster.

Click through to see what you’ll need, as well as the process to debug a function locally.

Comments closed

Advanced Imputation Techniques via scikit-learn

Published 2025-06-09 by Kevin Feasel

Ivan Palomares Carrascosa isn’t just using the median:

Missing values appear more often than not in many real-world datasets. There can be instances with missing values in one or several of their attributes for various reasons, such as human error, corrupted data, or incomplete data collection processes, e.g. from surveys with optional fields. While there exist basic strategies to deal with instances or attributes containing missing values, — like removing rows or columns entirely, or imputing missing values with a default value (typically the mean or median of the attribute) — these strategies are sometimes not sufficient.

This article presents some advanced strategies to handle missing data, namely, imputation techniques made possible through a combined use of Pandas and Scikit-learn libraries in Python.

Click through for three such techniques, including an example of how to use the technique and under which circumstances to avoid that technique.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Python

Decision Trees and Non-Tabular Data

From Pandas to Polars

Multithreading and Multiprocessing in Python

Finding Power BI Report History in Microsoft Fabric

Loading Data into Snowflake via Python

Handling Imbalanced Data in Python

Custom Libraries in Microsoft Fabric Data Engineering

Vector Search from Scratch

Debugging Fabric UDFs in Visual Studio Code

Advanced Imputation Techniques via scikit-learn