Press "Enter" to skip to content

Category: Python

Multi-Column Indexes in Pandas DataFrames

Brendan Tierney has a multi-part key:

It’s a little annoying when an API changes the structure of the data it returns and you end up with your code breaking. In my case, I experienced it when a dataframe having a single column index went to having a multi-column index. This was a new experience for me, at this time, as I hadn’t really come across it before. The following illustrates one particular case similar (not the same) that you might encounter. In this test/demo scenario I’ll be using the yfinance API to illustrate how you can remove the multi-column index and go back to having a single column index.

In this case, Brendan essentially builds a hierarchy and filters down to make a single key column (in this case, a date) relevant.

Leave a Comment

How Data Leakage Can Hurt Model Performance

Ivan Palomares Carrascosa leaks some data:

In this article, you will learn what data leakage is, how it silently inflates model performance, and practical patterns for preventing it across common workflows.

Topics we will cover include:

  • Identifying target leakage and removing target-derived features.
  • Preventing train–test contamination by ordering preprocessing correctly.
  • Avoiding temporal leakage in time series with proper feature design and splits.

Read on to learn more.

Leave a Comment

Parallelizing Python Code

Osheen MacOscar makes a function faster:

The way it is currently written is how any normal for loop will run, where the current iteration must finish before the next one starts. With this code we shouldn’t need to wait for the previous API call, there is no dependency or anything like that. In theory we could run all of the individual player queries at once and the function would be a lot faster.

Read on to see how.

Leave a Comment

Getting ML Services Running on SQL Server 2025

Greg Low takes a look at ML Services:

This is an update of a post that I wrote for SQL Server 2022 . Unfortunately, those instructions needed to be updated, not because anything notable has changed in SQL Server 2025, but because the recent distribution of Python has changed. Thanks to Peter Bishop for reporting what was now missing.

I hope that the versions Greg mentions—R 4.2 and Python 3.10—aren’t the latest that SQL Server supports, because those are both woefully out of date. Python 3.10 came out almost 4 years ago and R 4.2 is almost 3 years old at this point.

Leave a Comment

Creating Test Data in Python via Faker

Brendan Tierney generates some artificial data:

A some point everyone needs some test data for their database. There area a number of ways of doing this, and in this post I’ll walk through using the Python library Faker to create some dummy test data (that kind of looks real) in my Oracle Database. I’ll have another post using the GenAI in-database feature available in the Oracle Autonomous Database. So keep an eye out for that.

Faker is one of the available libraries in Python for creating dummy/test data that kind of looks realistic.

Brendan generates some demo customer data, including an example of credit rating that allows for assignment of probability for each class.

Comments closed

Pandas vs Polars for DataFrame Modification

Russ Hyde compares Pandas and Polars:

In Data Science we are often working with rectangular data structures – databases, spreadsheets, data-frames. Within Python alone, there are multiple ways to work with this type of data, and your choice is constrained by data volume, storage, fluency and so on. For datasets that could readily be held in memory on a single computer, the standard Python tool for rectangling is Pandas, which became an open-source project in 2009. Many other tools now exist though. In particular, the Polars library has become extremely popular in Python over recent years. But when Pandas works, is well-supported, and is the standard tool in your team or your domain, and if you are primarily working with in-memory datasets, is there a value in learning a new data-wrangling tool? Of course there is.

Read on for a demonstration of fairly basic data operations and how they differ in Pandas vs Polars.

Comments closed

Running SQL against Fabric Warehouses via Python

Jared Westover builds a loop:

In a previous article, I ran a SQL script against a Fabric Warehouse 100 times without needing to click ‘Execute’ each time. A WHILE loop could work, but Query Insights treats it as a single execution. While using GO was an option, I wanted a different approach because I’m always trying to expand my skill set. I need a scalable way to run scripts for performance testing.

This is a pretty simple database connection and script execution. For the most part, it would work just fine for any other SQL Server family member, just with a somewhat different connection string depending on the product.

Comments closed

Reviewing Power BI Report Interactions via Semantic Link Labs

Meagan Longoria wants to know about visual interactions:

It can be tedious to check what visual interactions have been configured in a Power BI report. If you have a lot of bookmarks, this becomes even more important. If you do this manually, you have to turn on Edit Interactions and select each visual to see what interactions it is emitting to the other visuals on the page.

But there is a better way!

Click through for that better way.

Comments closed

Packaging and Publishing Python Packages via Poetry

Osheen MacOscar forces me into alliteration:

So far, in the previous blog we covered creating our package with Poetry, managing our development environment and adding a function. In the current blog post we’ll be covering the next steps with package development including documentation, testing and how to publish to PyPI.

Read on for several tips on making Python code package-ready and then how to distribute it via PyPi.

Comments closed