Press "Enter" to skip to content

Category: Python

Time Series Helpers in NumPy

Bala Priya C shares some one-liners:

NumPy’s array operations can help simplify most common time series operations. Instead of thinking step-by-step through data transformations, you can apply vectorized operations that process entire datasets at once.

This article covers 10 NumPy one-liners that can be used for time series analysis tasks you’ll come across often. Let’s get started!

Click through to see the ten in action.

Leave a Comment

Tips for Working with Pandas

Matthew Mayo has a few tips when working with Pandas for data preparation:

If you’re reading this, it’s likely that you are already aware that the performance of a machine learning model is not just a function of the chosen algorithm. It is also highly influenced by the quality and representation of the data that said model has been trained on.

Data preprocessing and feature engineering are some of the most important steps in your machine learning workflow. In the Python ecosystem, Pandas is the go-to library for these types of data manipulation tasks, something you also likely know. Mastering a few select Pandas data transformation techniques can significantly streamline your workflow, make your code cleaner and more efficient, and ultimately lead to better performing models.

This tutorial will walk you through seven practical Pandas scenarios and the tricks that can enhance your data preparation and feature engineering process, setting you up for success in your next machine learning project.

Click through for those tips and tricks.

Leave a Comment

Using the Tabular Object Model via Semantic Link Labs

Gilbert Quevauvilliers does a bit of connecting:

In this blog post I am going to show you how to use the powerful Semantic Link Labs library for Tabular Object Model (TOM) for semantic model manipulation.

The goal of this blog post is to give you an understanding of how to connect using TOM, then based on the documentation use one of the functions.

Don’t get me wrong the documentation is great, but when implementing it, it works a little differently and I want others to know how to use it, so it can automate and simplify some repetitive tasks.

Read on for the instructions and some of the things you can do with the Semantic Link Labs library in Microsoft Fabric.

Leave a Comment

Visualizing ML Model Outcomes with Matplotlib

Matthew Mayo shares a few tips:

Visualizing model performance is an essential piece of the machine learning workflow puzzle. While many practitioners can create basic plots, elevating these from simple charts to insightful, elevated visualizations that can help easily tell the story of your machine leanring model’s interpretations and predictions is a skill that sets great professionals apart. The Matplotlib library, the foundational plotting tool in the scientific and computational Python ecosystem, is packed with features that can help you achieve this.

This tutorial provides 7 practical Matplotlib tricks that will help you better understand, evaluate, and present your machine learning models. We’ll move beyond the default settings to create visualizations that are not only aesthetically pleasing but also rich in information. These techniques are designed to integrate smoothly into your workflow with libraries like NumPy and Scikit-learn.

Click through for those tips.

Leave a Comment

Text Classification with Decision Trees

Ivan Palomares Carrascosa takes us through a simple natural language processing problem and solution:

It’s no secret that decision tree-based models excel at a wide range of classification and regression tasks, often based on structured, tabular data. However, when combined with the right tools, decision trees also become powerful predictive tools for unstructured data, such as text or images, and even time series data.

This article demonstrates how to build decision trees for text data. Specifically, we will incorporate text representation techniques like TF-IDF and embeddings in decision trees trained for spam email classification, evaluating their performance and comparing the results with another text classification model — all with the aid of Python’s Scikit-learn library.

Read on for the demos and to see how three different approaches work.

Leave a Comment

Portfolio Theory and Risk Reduction

John Mount continues a series on risk optimization:

I want to discuss how fragile optimization solutions to real world problems can be. And how to solve that.

Small changes in modeling strategy, assumptions, data, estimates, constraints, or objective can lead to unstable and degenerate solutions. To warm up let’s discuss one of the most famous optimization examples: Stigler’s minimal subsistence diet problem.

There are some neat stories in the post as you walk through problems of linear programming.

Also, Nina Zumel has a post on overestimation bias:

Revenue optimization projects can be particularly valuable and exciting. They involve:

  • Estimating demand as a function of offered features, price, and match to market.
  • Picking a set of offerings and prices optimizing the above inferred demand.

The great opportunity of these projects is that one can derive value from improving the inference of the demand estimate function, improving the optimization, and even improving the synergy between these two steps.

However, there is a common situation that can lose client trust and sink revenue optimization projects.

Read on for that article.

Leave a Comment

Feature Importance in XGBoost

Ivan Palomares Carrascosa takes a look at one of my favorite plots in XGBoost:

One of the most widespread machine learning techniques is XGBoost (Extreme Gradient Boosting). An XGBoost model — or an ensemble that combines multiple models into a single predictive task, to be more precise — builds several decision trees and sequentially combines them, so that the overall prediction is progressively improved by correcting the errors made by previous trees in the pipeline.

Just like standalone decision trees, XGBoost can accommodate both regression and classification tasks. While the combination of many trees into a single composite model may obscure its interpretability at first, there are still mechanisms to help you interpret an XGBoost model. In other words, you can understand why predictions are made and how input features contributed to them.

This article takes a practical dive into XGBoost model interpretability, with a particular focus on feature importance.

Read on to learn more about how feature importance works, as well as the three different views of the data you can get.

Leave a Comment

Using Python Code in SSIS

Tim Mitchell shoe-horns a language in:

SQL Server Integration Services (SSIS) is a mature, proven tool for ETL orchestration and data movement. In recent years, Python has exploded in popularity as a data movement and analysis tool. Surprisingly, though, there are no native hooks for Python in SSIS. In my experience using each of these tools independently, I’d love to see an extension of SSIS to naturally host Python integrations.

Fortunately, with a bit of creativity, it is possible to invoke Python logic in SSIS packages. In this post, I’ll walk you through the tasks to merge Python and SSIS together. If you want to follow along on your own, you can clone the repo I created for this project.

Honestly, it’s not that surprising. The last time there was significant development on Integration Services was roughly 2012 (unless you include the well-intentioned but barely-functional Hadoop support they added in around 2016). At that point, in the Windows world, Python was not at all a dominant programming language.

Leave a Comment

Time Series Feature Engineering in Pandas

Matthew Mayo knows that time is a flat circle:

Feature engineering is one of the most important steps when it comes to building effective machine learning models, and this is no less important when dealing with time-series data. By being able to create meaningful features from temporal data, you can unlock predictive power that is unavailable when applied to raw timestamps alone.

Fortunately for us all, Pandas offers a powerful and flexible set of operations for manipulating and creating time-series features.

Click through for seven things you can do in Pandas to extend or work with time series data.

Leave a Comment