Press "Enter" to skip to content

Category: Python

A Primer on Bayesian Modeling

Hristo Hristov is speaking my language:

Multivariate analysis in data science is a type of analysis that tackles multiple input/predictor and output/predicted variables. This tip explores the problem of predicting air pollution measured in particulate matter (PM) concentration based on ambient temperature, humidity, and pressure using a Bayesian Model.

Click through for a detailed code sample and explanation.

Comments closed

Using the Tabular Object Model via Semantic Link Labs

Gilbert Quevauvilliers does a bit of connecting:

In this blog post I am going to show you how to use the powerful Semantic Link Labs library for Tabular Object Model (TOM) for semantic model manipulation.

The goal of this blog post is to give you an understanding of how to connect using TOM, then based on the documentation use one of the functions.

Don’t get me wrong the documentation is great, but when implementing it, it works a little differently and I want others to know how to use it, so it can automate and simplify some repetitive tasks.

Read on for the instructions and some of the things you can do with the Semantic Link Labs library in Microsoft Fabric.

Comments closed

Visualizing ML Model Outcomes with Matplotlib

Matthew Mayo shares a few tips:

Visualizing model performance is an essential piece of the machine learning workflow puzzle. While many practitioners can create basic plots, elevating these from simple charts to insightful, elevated visualizations that can help easily tell the story of your machine leanring model’s interpretations and predictions is a skill that sets great professionals apart. The Matplotlib library, the foundational plotting tool in the scientific and computational Python ecosystem, is packed with features that can help you achieve this.

This tutorial provides 7 practical Matplotlib tricks that will help you better understand, evaluate, and present your machine learning models. We’ll move beyond the default settings to create visualizations that are not only aesthetically pleasing but also rich in information. These techniques are designed to integrate smoothly into your workflow with libraries like NumPy and Scikit-learn.

Click through for those tips.

Comments closed

Text Classification with Decision Trees

Ivan Palomares Carrascosa takes us through a simple natural language processing problem and solution:

It’s no secret that decision tree-based models excel at a wide range of classification and regression tasks, often based on structured, tabular data. However, when combined with the right tools, decision trees also become powerful predictive tools for unstructured data, such as text or images, and even time series data.

This article demonstrates how to build decision trees for text data. Specifically, we will incorporate text representation techniques like TF-IDF and embeddings in decision trees trained for spam email classification, evaluating their performance and comparing the results with another text classification model — all with the aid of Python’s Scikit-learn library.

Read on for the demos and to see how three different approaches work.

Comments closed

Feature Importance in XGBoost

Ivan Palomares Carrascosa takes a look at one of my favorite plots in XGBoost:

One of the most widespread machine learning techniques is XGBoost (Extreme Gradient Boosting). An XGBoost model — or an ensemble that combines multiple models into a single predictive task, to be more precise — builds several decision trees and sequentially combines them, so that the overall prediction is progressively improved by correcting the errors made by previous trees in the pipeline.

Just like standalone decision trees, XGBoost can accommodate both regression and classification tasks. While the combination of many trees into a single composite model may obscure its interpretability at first, there are still mechanisms to help you interpret an XGBoost model. In other words, you can understand why predictions are made and how input features contributed to them.

This article takes a practical dive into XGBoost model interpretability, with a particular focus on feature importance.

Read on to learn more about how feature importance works, as well as the three different views of the data you can get.

Comments closed

Portfolio Theory and Risk Reduction

John Mount continues a series on risk optimization:

I want to discuss how fragile optimization solutions to real world problems can be. And how to solve that.

Small changes in modeling strategy, assumptions, data, estimates, constraints, or objective can lead to unstable and degenerate solutions. To warm up let’s discuss one of the most famous optimization examples: Stigler’s minimal subsistence diet problem.

There are some neat stories in the post as you walk through problems of linear programming.

Also, Nina Zumel has a post on overestimation bias:

Revenue optimization projects can be particularly valuable and exciting. They involve:

  • Estimating demand as a function of offered features, price, and match to market.
  • Picking a set of offerings and prices optimizing the above inferred demand.

The great opportunity of these projects is that one can derive value from improving the inference of the demand estimate function, improving the optimization, and even improving the synergy between these two steps.

However, there is a common situation that can lose client trust and sink revenue optimization projects.

Read on for that article.

Comments closed

Using Python Code in SSIS

Tim Mitchell shoe-horns a language in:

SQL Server Integration Services (SSIS) is a mature, proven tool for ETL orchestration and data movement. In recent years, Python has exploded in popularity as a data movement and analysis tool. Surprisingly, though, there are no native hooks for Python in SSIS. In my experience using each of these tools independently, I’d love to see an extension of SSIS to naturally host Python integrations.

Fortunately, with a bit of creativity, it is possible to invoke Python logic in SSIS packages. In this post, I’ll walk you through the tasks to merge Python and SSIS together. If you want to follow along on your own, you can clone the repo I created for this project.

Honestly, it’s not that surprising. The last time there was significant development on Integration Services was roughly 2012 (unless you include the well-intentioned but barely-functional Hadoop support they added in around 2016). At that point, in the Windows world, Python was not at all a dominant programming language.

Comments closed

Time Series Feature Engineering in Pandas

Matthew Mayo knows that time is a flat circle:

Feature engineering is one of the most important steps when it comes to building effective machine learning models, and this is no less important when dealing with time-series data. By being able to create meaningful features from temporal data, you can unlock predictive power that is unavailable when applied to raw timestamps alone.

Fortunately for us all, Pandas offers a powerful and flexible set of operations for manipulating and creating time-series features.

Click through for seven things you can do in Pandas to extend or work with time series data.

Comments closed

Working with Microsoft’s First-Party Python Driver

Sebastiao Pereira takes a look at mssql-python:

Python can connect to SQL Server using drivers like pyodbc and pymssql. However, Microsoft recently released a new Python driver called Python Driver for SQL Server or mssql-python. Currently in preview, Microsoft describes it as “the only first-party driver.” So, what’s this new driver all about, and how do you use it? Learn how to configure Python to connect to SQL Server with this new driver.

My standard caveat applies: this looks pretty neat, assuming that Microsoft actually continues to support it. Sebastiao mentions that it requires Python 3.13, but the docs say 3.10 or later. If the former is true, it might be a while before a lot of shops actually use it. But if the latter is true, most Python installations should support the driver out of the box.

Comments closed

Getting beyond Pandas

Shittu Olumide recommends a few other packages:

If you’ve worked with data in Python, chances are you’ve used Pandas many times. And for good reason; it’s intuitive, flexible, and great for day-to-day analysis. But as your datasets start to grow, Pandas starts to show its limits. Maybe it’s memory issues, sluggish performance, or the fact that your machine sounds like it’s about to lift off when you try to group by a few million rows.

That’s the point where a lot of data analysts and scientists start asking the same question: what else is out there?

Read on for seven options, including six libraries and one built-in programming technique.

Comments closed