Python – Page 2 – Curated SQL

Loading Data into Snowflake via Python

Published 2025-06-13 by Kevin Feasel

Anil Kumar Moka does a bit of data loading:

In our ongoing exploration of Snowflake data loading strategies, we’ve previously examined how to use pandas with SQLAlchemy to efficiently move data into Snowflake tables. That approach leverages pandas’ intuitive DataFrame handling and works well for many common scenarios where you’re already manipulating data in Python before loading it to Snowflake.

In this article, we’re diving deeper into the Snowflake toolbox by exploring the native Snowflake Connector for Python. While pandas offers simplicity and familiarity, the native connector provides a different set of capabilities focused on precision control and Snowflake-specific optimizations. This article explains you when and how to use this more direct approach for everything from small CSV files to massive datasets that would overwhelm pandas.

Click through for the full article.

Comments closed

Handling Imbalanced Data in Python

Published 2025-06-13 by Kevin Feasel

Ivan Palomares Carrascosa gives three ways to deal with imbalanced data:

Here’s the catch: having imbalanced data usually makes analysis processes more difficult, especially for machine learning models that can easily get biased toward the majority class as a result of dealing with data with a remarkably unequal class distribution, thereby ending up becoming an almost “dummy classifier” that assigns the same class to virtually everything — in the most extreme case.

This article shows several strategies to navigate and handle imbalanced datasets using two of Python’s most stellar libraries for “all things data”: Pandas and Scikit-learn.

Click through for those ways, including sample code.

Comments closed

Custom Libraries in Microsoft Fabric Data Engineering

Published 2025-06-13 by Kevin Feasel

Gerhard Brueckl isn’t content with the defaults:

When working with Spark or data engineering in general in Microsoft Fabric, you will sooner or later come to the point where you need to reuse some of the code that you have already written in another notebook. Best practice is to put these code pieces into a central place from where it can be referenced and reused. This way you can make sure all notebooks always use the very same code and it is also easy to develop, update and test the common functions.

As Gerhard mentions, having common notebooks with utilities is fine for when you’re getting started with development, but being able to centralize functions in proper libraries can make that code a lot more useful, not just in the context of the single notebook.

I believe that this does allow for arbitrary code execution, so someone with sufficient permissions to create a notebook and import code from arbitrary locations would be able to execute that code. I think there are ways of limiting this risk (such as not allowing your Fabric hosts to connect to any remote servers other than ones you explicitly allow), but it’s something I’d have to puzzle through.

Comments closed

Vector Search from Scratch

Published 2025-06-11 by Kevin Feasel

Kanwai Mehreen does a bit of searching:

In this article, I’ll walk you through every step from generating vector representations to searching using cosine similarity, and we’ll even visualize what’s happening behind the scenes. By the end, you’ll not only understand how vector search works but also have a working implementation you can build on. So, let’s get started.

It’s kind of funny how simple this is, but it is. A lot of the complexity is around data quality operations, as well as optimizing the search process.

Comments closed

Debugging Fabric UDFs in Visual Studio Code

Published 2025-06-10 by Kevin Feasel

Sunitha Muthukrishna takes us through a debugging exercise:

Debugging your code is important to identify issues and mitigate them when you’re working with user data functions in Microsoft Fabric. You want to make sure everything works as it should and that’s where local debugging lets you catch problems in your code without messing with the live environment. In this blog post, I will walk you through the steps to make local debugging easier and faster.

Click through to see what you’ll need, as well as the process to debug a function locally.

Comments closed

Advanced Imputation Techniques via scikit-learn

Published 2025-06-09 by Kevin Feasel

Ivan Palomares Carrascosa isn’t just using the median:

Missing values appear more often than not in many real-world datasets. There can be instances with missing values in one or several of their attributes for various reasons, such as human error, corrupted data, or incomplete data collection processes, e.g. from surveys with optional fields. While there exist basic strategies to deal with instances or attributes containing missing values, — like removing rows or columns entirely, or imputing missing values with a default value (typically the mean or median of the attribute) — these strategies are sometimes not sufficient.

This article presents some advanced strategies to handle missing data, namely, imputation techniques made possible through a combined use of Pandas and Scikit-learn libraries in Python.

Click through for three such techniques, including an example of how to use the technique and under which circumstances to avoid that technique.

Comments closed

Writing a Python Data Frame to a Lakehouse Table

Published 2025-06-04 by Kevin Feasel

Gilbert Quevauvilliers continues a series on Python notebooks and DuckDB:

In this blog post I am going to explain how to loop through a data frame to query data and write once to a Lakehouse table.

The example I will use is to loop through a list of dates which I get from my date table, then query an API, append to an existing data frame and finally write once to a Lakehouse table.

Click through for the code, as well as a sample notebook you can use.

Comments closed

Survival Analysis with Techtonique

Published 2025-06-02 by Kevin Feasel

Thierry Moudiki shows off a survival analysis:

In today’s post, we’ll see how to use rush and the probabilistic survival analysis API provided by techtonique.net (along with R and Python) to plot survival curves . Note that the web app also contains a page for plotting these curves, in 1 click. You can also read this post for more Python examples.

Click through for the demo. H/T R-Bloggers.

Comments closed

Querying Multiple Lakehouse Tables via Python Notebook

Published 2025-05-29 by Kevin Feasel

Gilbert Quevauvilliers builds a data frame:

In this blog post I am going to explain how to query multiple Lakehouse tables into a data frame.

The example I am going to use is when you want to load new data into your staging tables, but you need to know the max date from your previous data load.

Read on to see how. The answer, as you might suspect, involves DuckDB.

Comments closed

Writing DAX Query Outputs to Lakehouse Tables

Published 2025-05-15 by Kevin Feasel

Gilbert Quevauvilliers does a bit of writing:

In this blog post I am going to explain how to use a Python Notebook using the Semantic Link module, to run a DAX query and write the output to a Lakehouse table.

I will show you how to install a Python library and then use it within my python notebook.

Read on for a quick primer on Semantic Link Labs, followed by the meat of the article.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Python