Press "Enter" to skip to content

Category: Python

Working with Microsoft’s First-Party Python Driver

Sebastiao Pereira takes a look at mssql-python:

Python can connect to SQL Server using drivers like pyodbc and pymssql. However, Microsoft recently released a new Python driver called Python Driver for SQL Server or mssql-python. Currently in preview, Microsoft describes it as “the only first-party driver.” So, what’s this new driver all about, and how do you use it? Learn how to configure Python to connect to SQL Server with this new driver.

My standard caveat applies: this looks pretty neat, assuming that Microsoft actually continues to support it. Sebastiao mentions that it requires Python 3.13, but the docs say 3.10 or later. If the former is true, it might be a while before a lot of shops actually use it. But if the latter is true, most Python installations should support the driver out of the box.

Leave a Comment

Getting beyond Pandas

Shittu Olumide recommends a few other packages:

If you’ve worked with data in Python, chances are you’ve used Pandas many times. And for good reason; it’s intuitive, flexible, and great for day-to-day analysis. But as your datasets start to grow, Pandas starts to show its limits. Maybe it’s memory issues, sluggish performance, or the fact that your machine sounds like it’s about to lift off when you try to group by a few million rows.

That’s the point where a lot of data analysts and scientists start asking the same question: what else is out there?

Read on for seven options, including six libraries and one built-in programming technique.

Leave a Comment

Using Python in R in Excel

Adam Gladstone wraps up a series on an R add-in for Excel:

In the last post in this series I am going to look at calling Python from R. Even though Excel now provides a means of calling Python scripts directly, using the =PY() formula in a worksheet, there are still occasions when it is beneficial to call Python via R. For example, it turns out that importing yfinance produces a ‘module not found’ error using Excel’s function. According to the documentation, yfinance is not one of the open source libraries that the Excel Python secure distribution supports. To get around this issue, we can use the R package Reticulate. This lets us load and run Python scripts from R. As we have seen in the previous parts of this series, the ExcelRAddIn allows us to run R scripts from an Excel worksheet. And putting these two together is quite simple.

I’m glad Adam mentioned this because my first question was going to be, why use this when Excel has Python capabilities built in? And that’s a reasonable answer.

Leave a Comment

Generating Synthetic Data in Python

Ivan Palomares Carrascosa makes some data:

This article introduces the Faker library for generating synthetic datasets. Through a gentle hands-on tutorial, we will explore how to generate single records or data instances, full datasets in one go, and export them into different formats. The code walkthrough adopts a twofold perspective:

  1. Learning: We will gain a basic understanding of several data types that can be generated and how to get them ready for further processing, aided by popular data-intensive libraries like Pandas
  2. Testing: With some generated data at hand, we will provide some hints on how to test data issues in the context of a simplified ETL (Extract, Transform, Load) pipeline that ingests synthetically generated transactional data.

Click through for the article. I’m not intimately familiar with Faker, so I’m not sure how easy it is to change dataset distributions. That’s one of the challenges I tend to have with automated data generators: generating a simulated dataset is fine if you just need X number of rows, but if the distribution of synthetic data in development is nowhere near what the real data’s distribution is in production, you may get a false sense of security in things like report response times.

Leave a Comment

Incremental Data Load into Parquet Files from Python

Lee Asher loads some data:

Parquet is a column-oriented open-source storage format increasingly used for “big data” analytics. Yet despite its growing popularity as a native format for data lakes and data warehouses, tools for maintaining these environments remain scarce. Getting data from a SQL environment into Parquet isn’t difficult – but how do we maintain that data over time, keeping it current? In other words, if we already have an existing Parquet file, how can we efficiently append new data to it?

In this article, we’ll introduce the Parquet format, explain some strategies for incrementally updating a Parquet repository, and, with a simple Python script, implement a nightly-feed update process.

Not listed in here is one word that I expected: Delta. Because that’s how we normally do incremental data modification in Parquet data. Either that or Apache Iceberg. Lee shows us a different route that can work.

Leave a Comment

Data Cleansing Tips in Pandas

Jayita Gulati shares some tips:

Data preparation is one of the most time-consuming parts of any data science or analytics project, but it doesn’t have to be. With the proper techniques, Pandas can help you quickly transform messy and complex datasets into clean, ready-to-analyze formats. From handling missing data to reshaping and optimizing your DataFrames, a few tricks can save you hours of work.

In this article, you will discover seven practical Pandas tips that can speed up your data prep process and help you focus more on analysis and less on cleanup.

Two of the tips are basically “use functional programming techniques,” and I’m okay with that.

Comments closed

Decision Trees and Non-Tabular Data

Ivan Palomares Carrascosa explains that you can use more than standard structured data against decision trees:

Versatile, interpretable, and effective for a variety of use cases, decision trees have been among the most well-established machine learning techniques for decades, widely used for classification and regression tasks. Yet, they are still widely used — whether as standalone models or as components of more powerful ensemble methods like random forests and gradient boosting machines.

And there is one more attractive feature that pushes the boundaries of their versatility even further: they can accommodate data in diverse formats, beyond just fully structured, tabular data. This article examines this facet of decision trees from a balanced theoretical and practical approach.

Click through for an example.

Comments closed

From Pandas to Polars

Ivan Palomares Carrascosa provides an introduction to the polars library:

Polars is currently one of the fastest open-source libraries for data manipulation and processing on a single machine, featuring an intuitive and user-friendly API. Natively built in Rust, it is designed to optimize low memory consumption and speed while working with DataFrames.

This article takes a tour of Polars library in Python and illustrates how it can be seamlessly used similarly to Pandas to efficiently manipulate large datasets.

My experience with polars is that it’s not a 1:1 replacement for pandas, but the interfaces are similar enough that a lot of code can swap over without much effort. And yes, it’s typically faster.

Comments closed

Multithreading and Multiprocessing in Python

Jessica Wachtel explains how two systems work in Python:

Let’s use a simple example to understand them: a mechanics shop. Concurrency happens when one mechanic works on several cars by switching between them. For example, the mechanic changes the oil in one car while waiting for a part for another. They don’t finish one car before starting the next, but they can’t do two tasks at exactly the same time. The tasks overlap in time but don’t happen simultaneously.

Click through for the analogy, how it applies to Python, and tips and tricks around each.

Comments closed