Press "Enter" to skip to content

Category: Python

Comparing Techniques for Text Featurization in Classification Problems

Ivan Palomaras Carrascosa tries a few things:

In this article, you will learn how Bag-of-Words, TF-IDF, and LLM-generated embeddings compare when used as text features for classification and clustering in scikit-learn.

Topics we will cover include:

  • How to generate Bag-of-Words, TF-IDF, and LLM embeddings for the same dataset.
  • How these representations compare on text classification performance and training speed.
  • How they behave differently for unsupervised document clustering.

Click through for results. Granted, the specific embedding model can alter the quality of results, but even so, I do enjoy the comparison of techniques and the reminder that neural networks aren’t the ultimate solution to everything.

Leave a Comment

Web Scraping with Python

Jason Yousef has a script:

Below is a production-friendly pattern that:

  • Uses a requests.Session with retries, backoff, and a real User-Agent
  • Sets sane timeouts and handles common HTTP errors
  • Respects robots.txt (and tells you if scraping is disallowed)
  • Parses only mailto: links by default to avoid scraping personal data you shouldn’t
  • Handles pagination with a “Next” link when present
  • Exports to CSV
  • Can be run from the command line with arguments

Click through for the code, some explanation of how it works, and a few tips.

Leave a Comment

Using the mssql-python Driver

Hristo Hristov tries out a driver:

Programmatic interaction with SQL Server or Azure SQL from a Python script is possible using a driver. A popular driver has been pyodbc that can be used standalone or with a SQLAlchemy wrapper. SQLAlchemy on its own is the Python SQL toolkit and Object Relational Mapper for developers. In the end of 2025 Microsoft released v1 of their own Python SQL driver called mssql-python. How do you get started using mssql-python for programmatic access to your SQL Server?

Click through to see how it works. Hristo points out a couple of benefits to this driver over the classic pyodbc driver, though I’m curious if there are any performance differences between the two.

Leave a Comment

The Downsides of Python

Andy Brown writes a companion piece:

Four years ago I wrote a blog on this site explaining why Python is better than C# and, arguably, most other programming languages. To redress the balance, here are 10 reasons why you might want to avoid getting caught up in Python’s oh-so-tempting coils – particularly when building large, long-lived systems.

If this sounds like an attempt to have my cake and eat it, my defense is that I follow in my work what I preach here: I use Python for ad-hoc jobs, at which it is unsurpassed. For larger systems – such as our MV website – I use C#, due to its strengths in maintainability, tooling as well as the practical consideration that my personal preference for Visual Basic is not shared by the wider team.

Some of it is opinion, some of it is annoying. I’ve grown to appreciate the spacing, though it can be really painful when copying code from somewhere and the spacing gets all messed up. My short version of Python is that it requires you to have more discipline as a developer to prevent messes from occurring, and I think that’s a negative on net. But that same aspect simultaneously makes it so much easier to prototype and rapidly solve problems, so there’s a natural trade-off here.

Leave a Comment

Fixtures in Pytest

Jason Yousef shows off a capability in Pytest:

Pytest is one of those tools that feels obvious after you’ve used it for a bit. Tests are just functions. Assertions read like normal Python. And when you need context—database sessions, config, mock data—you reach for fixtures instead of duct tape.

Read on to see how they work. Admittedly, I don’t think I’ve used fixtures before in Pytest, but now seems like a good time to try it.

Leave a Comment

Hosting an ML Model with FastAPI

Kanwal Mehreen hosts a model:

In this article, you will learn how to package a trained machine learning model behind a clean, well-validated HTTP API using FastAPI, from training to local testing and basic production hardening.

Topics we will cover include:

  • Training, saving, and loading a scikit-learn pipeline for inference
  • Building a FastAPI app with strict input validation via Pydantic
  • Exposing, testing, and hardening a prediction endpoint with health checks

Let’s explore these techniques. 

I definitely enjoy how simple it is to use FastAPI.

Comments closed

A Primer on Data Analysis with Python and SQL Server

Eduardo Pivaral shows off a few examples of analysis techniques:

With the rise of cloud, automation and managed services, the role of the Database Administrator has pivoted towards Data Engineering.  The focus is to maintain, secure, and cleanse data in order for data analysis and decision making by the business.

How can we start using modern data analysis tools with our current SQL Server infrastructure? Further, how can we start providing end users and decision makers with important insights about our data, without spending extra money on enterprise data analysis tools?

Click through for demonstrations of k-means clustering for discerning categorical groups of data, simple demand forecasting, and generating customer segments.

Comments closed

Python Libraries for Advanced Time Series Forecasting

Ivan Palomares Carrascosa has a list:

Fortunately, Python’s ecosystem has evolved to meet this demand. The landscape has shifted from purely statistical packages to a rich array of libraries that integrate deep learning, machine learning pipelines, and classical econometrics. But with so many options, choosing the right framework can be overwhelming.

This article cuts through the noise to focus on 5 powerhouse Python libraries designed specifically for advanced time series forecasting. We move beyond the basics to explore tools capable of handling high-dimensional data, complex seasonality, and exogenous variables. For each library, we provide a high-level overview of its standout features and a concise “Hello World” code snippet to familiarize yourself immediately.

Click through for an explanation of each of the five libraries.

Comments closed

Multi-Column Indexes in Pandas DataFrames

Brendan Tierney has a multi-part key:

It’s a little annoying when an API changes the structure of the data it returns and you end up with your code breaking. In my case, I experienced it when a dataframe having a single column index went to having a multi-column index. This was a new experience for me, at this time, as I hadn’t really come across it before. The following illustrates one particular case similar (not the same) that you might encounter. In this test/demo scenario I’ll be using the yfinance API to illustrate how you can remove the multi-column index and go back to having a single column index.

In this case, Brendan essentially builds a hierarchy and filters down to make a single key column (in this case, a date) relevant.

Comments closed

How Data Leakage Can Hurt Model Performance

Ivan Palomares Carrascosa leaks some data:

In this article, you will learn what data leakage is, how it silently inflates model performance, and practical patterns for preventing it across common workflows.

Topics we will cover include:

  • Identifying target leakage and removing target-derived features.
  • Preventing train–test contamination by ordering preprocessing correctly.
  • Avoiding temporal leakage in time series with proper feature design and splits.

Read on to learn more.

Comments closed