Press "Enter" to skip to content

Category: Python

Estimating Probabilities from Unevenly Collected Data

Nina Zumel answers an important question:

In this article, we look at the problem of estimating and comparing probabilities about a population of subjects from unevenly collected observations. Some examples might include:

  • The perceived quality of a movie (how often is a movie positively reviewed) when some movies have far more reviews than others.
  • The effectiveness of various ad campaigns, when some compaigns have had more exposure than others.
  • The efficacy of a certain medical procedure by hospital, when some hospitals have had more cases than others.

For our specific task, we’ll try to estimate the “innate” batting ability (the probability of making a hit when at bat) of major league baseball players in 2023. For the sake of this article, we will take this single season of data as everything that we know about these players and their batting statistics.

It’s an interesting problem because she’s looking at 2023 data as an estimation of the player’s entire career, with the goal of estimating how a player will perform overall given a fairly reasonably sized sample of information collected from one relatively short period of that player’s career. H/T John Mount.

Leave a Comment

A Look at Tabular Foundation Models

Michael Mayer tries out a neural network model:

Tabular data has had a comfortable life for years. Gradient boosting showed up, got very good at its job, and then quietly became the default answer to almost everything with rows and columns.

In very recent years, a new player has arrived: the tabular foundation model or prior fitted neural network, and suddenly tabular data is sounding a lot less sleepy…

I’ve done a bit with TabPFN and come away fairly impressed. I’ll have to give this a go as well. There are definite limitations to data sizes before things fall over, but for moderate sizes (50k or fewer rows), TabPFN at least worked pretty well.

Leave a Comment

Performing ELT with Python and DuckDB

Jamal Hansen shows off a capable in-memory analytic database:

This is a real-world example of a common data engineering pattern. You may have heard of ETL (Extract, Transform, Load), where data is transformed before it reaches its destination. What we are actually building today is the more modern variant, ELT: Extract, Load, Transform.

Read on for the process. I like DuckDB a lot and this is one of the types of use cases in which it excels.

Leave a Comment

Bulk Loading Data with mssql-python

Chad Callihan loads some data:

I’ve had some projects in the past that involved using Python to load data in SQL Server. It wasn’t unbearably slow, but it seemed like a process that could be faster. For that reason, a recent SQL Server blog post about bulk loading data with Python caught my eye. I decided to test out the new mssql-python 1.4.0 mentioned in that post and see how much of an impact it would make on loading speed.

Chad saw about a 10x improvement in performance. I’ve had some similar results in production environments. The mssql-python library is a legitimate improvement over the classic ODBC driver and pyodbc.

Comments closed

Training, Serving, and Deploying Scikit-Learn Models via FastAPI

Abid Ali Awan serves a model:

In this article, you will learn how to train a Scikit-learn classification model, serve it with FastAPI, and deploy it to FastAPI Cloud.

Topics we will cover include:

  • How to structure a simple project and train a Scikit-learn model for inference.
  • How to build and test a FastAPI inference API locally.
  • How to deploy the API to FastAPI Cloud and prepare it for more production-ready usage.

Click through for the process.

Comments closed

Zero-Shot Text Classification in Python

Abid Ali Awan doesn’t have time to train:

In this article, you will learn how zero-shot text classification works and how to apply it using a pretrained transformer model.

Topics we will cover include:

  • The core idea behind zero-shot classification and how it reframes labeling as a reasoning task.
  • How to use a pretrained model to classify text without task-specific training data.
  • Practical techniques such as multi-label classification and hypothesis template tuning.

This typically works best when the set of classes is quite distinct and limited in number. Once you get past several classes, the likelihood of spurious results increases considerably and that’s when you’re back to model training/fine-tuning based off of sufficient quantities of labeled data.

Comments closed

Comparing Techniques for Text Featurization in Classification Problems

Ivan Palomaras Carrascosa tries a few things:

In this article, you will learn how Bag-of-Words, TF-IDF, and LLM-generated embeddings compare when used as text features for classification and clustering in scikit-learn.

Topics we will cover include:

  • How to generate Bag-of-Words, TF-IDF, and LLM embeddings for the same dataset.
  • How these representations compare on text classification performance and training speed.
  • How they behave differently for unsupervised document clustering.

Click through for results. Granted, the specific embedding model can alter the quality of results, but even so, I do enjoy the comparison of techniques and the reminder that neural networks aren’t the ultimate solution to everything.

Comments closed

Web Scraping with Python

Jason Yousef has a script:

Below is a production-friendly pattern that:

  • Uses a requests.Session with retries, backoff, and a real User-Agent
  • Sets sane timeouts and handles common HTTP errors
  • Respects robots.txt (and tells you if scraping is disallowed)
  • Parses only mailto: links by default to avoid scraping personal data you shouldn’t
  • Handles pagination with a “Next” link when present
  • Exports to CSV
  • Can be run from the command line with arguments

Click through for the code, some explanation of how it works, and a few tips.

Comments closed

Using the mssql-python Driver

Hristo Hristov tries out a driver:

Programmatic interaction with SQL Server or Azure SQL from a Python script is possible using a driver. A popular driver has been pyodbc that can be used standalone or with a SQLAlchemy wrapper. SQLAlchemy on its own is the Python SQL toolkit and Object Relational Mapper for developers. In the end of 2025 Microsoft released v1 of their own Python SQL driver called mssql-python. How do you get started using mssql-python for programmatic access to your SQL Server?

Click through to see how it works. Hristo points out a couple of benefits to this driver over the classic pyodbc driver, though I’m curious if there are any performance differences between the two.

Comments closed

The Downsides of Python

Andy Brown writes a companion piece:

Four years ago I wrote a blog on this site explaining why Python is better than C# and, arguably, most other programming languages. To redress the balance, here are 10 reasons why you might want to avoid getting caught up in Python’s oh-so-tempting coils – particularly when building large, long-lived systems.

If this sounds like an attempt to have my cake and eat it, my defense is that I follow in my work what I preach here: I use Python for ad-hoc jobs, at which it is unsurpassed. For larger systems – such as our MV website – I use C#, due to its strengths in maintainability, tooling as well as the practical consideration that my personal preference for Visual Basic is not shared by the wider team.

Some of it is opinion, some of it is annoying. I’ve grown to appreciate the spacing, though it can be really painful when copying code from somewhere and the spacing gets all messed up. My short version of Python is that it requires you to have more discipline as a developer to prevent messes from occurring, and I think that’s a negative on net. But that same aspect simultaneously makes it so much easier to prototype and rapidly solve problems, so there’s a natural trade-off here.

Comments closed