Press "Enter" to skip to content

Category: Python

Using Python in R in Excel

Adam Gladstone wraps up a series on an R add-in for Excel:

In the last post in this series I am going to look at calling Python from R. Even though Excel now provides a means of calling Python scripts directly, using the =PY() formula in a worksheet, there are still occasions when it is beneficial to call Python via R. For example, it turns out that importing yfinance produces a ‘module not found’ error using Excel’s function. According to the documentation, yfinance is not one of the open source libraries that the Excel Python secure distribution supports. To get around this issue, we can use the R package Reticulate. This lets us load and run Python scripts from R. As we have seen in the previous parts of this series, the ExcelRAddIn allows us to run R scripts from an Excel worksheet. And putting these two together is quite simple.

I’m glad Adam mentioned this because my first question was going to be, why use this when Excel has Python capabilities built in? And that’s a reasonable answer.

Comments closed

Generating Synthetic Data in Python

Ivan Palomares Carrascosa makes some data:

This article introduces the Faker library for generating synthetic datasets. Through a gentle hands-on tutorial, we will explore how to generate single records or data instances, full datasets in one go, and export them into different formats. The code walkthrough adopts a twofold perspective:

  1. Learning: We will gain a basic understanding of several data types that can be generated and how to get them ready for further processing, aided by popular data-intensive libraries like Pandas
  2. Testing: With some generated data at hand, we will provide some hints on how to test data issues in the context of a simplified ETL (Extract, Transform, Load) pipeline that ingests synthetically generated transactional data.

Click through for the article. I’m not intimately familiar with Faker, so I’m not sure how easy it is to change dataset distributions. That’s one of the challenges I tend to have with automated data generators: generating a simulated dataset is fine if you just need X number of rows, but if the distribution of synthetic data in development is nowhere near what the real data’s distribution is in production, you may get a false sense of security in things like report response times.

Comments closed

Incremental Data Load into Parquet Files from Python

Lee Asher loads some data:

Parquet is a column-oriented open-source storage format increasingly used for “big data” analytics. Yet despite its growing popularity as a native format for data lakes and data warehouses, tools for maintaining these environments remain scarce. Getting data from a SQL environment into Parquet isn’t difficult – but how do we maintain that data over time, keeping it current? In other words, if we already have an existing Parquet file, how can we efficiently append new data to it?

In this article, we’ll introduce the Parquet format, explain some strategies for incrementally updating a Parquet repository, and, with a simple Python script, implement a nightly-feed update process.

Not listed in here is one word that I expected: Delta. Because that’s how we normally do incremental data modification in Parquet data. Either that or Apache Iceberg. Lee shows us a different route that can work.

Comments closed

Data Cleansing Tips in Pandas

Jayita Gulati shares some tips:

Data preparation is one of the most time-consuming parts of any data science or analytics project, but it doesn’t have to be. With the proper techniques, Pandas can help you quickly transform messy and complex datasets into clean, ready-to-analyze formats. From handling missing data to reshaping and optimizing your DataFrames, a few tricks can save you hours of work.

In this article, you will discover seven practical Pandas tips that can speed up your data prep process and help you focus more on analysis and less on cleanup.

Two of the tips are basically “use functional programming techniques,” and I’m okay with that.

Comments closed

Decision Trees and Non-Tabular Data

Ivan Palomares Carrascosa explains that you can use more than standard structured data against decision trees:

Versatile, interpretable, and effective for a variety of use cases, decision trees have been among the most well-established machine learning techniques for decades, widely used for classification and regression tasks. Yet, they are still widely used — whether as standalone models or as components of more powerful ensemble methods like random forests and gradient boosting machines.

And there is one more attractive feature that pushes the boundaries of their versatility even further: they can accommodate data in diverse formats, beyond just fully structured, tabular data. This article examines this facet of decision trees from a balanced theoretical and practical approach.

Click through for an example.

Comments closed

From Pandas to Polars

Ivan Palomares Carrascosa provides an introduction to the polars library:

Polars is currently one of the fastest open-source libraries for data manipulation and processing on a single machine, featuring an intuitive and user-friendly API. Natively built in Rust, it is designed to optimize low memory consumption and speed while working with DataFrames.

This article takes a tour of Polars library in Python and illustrates how it can be seamlessly used similarly to Pandas to efficiently manipulate large datasets.

My experience with polars is that it’s not a 1:1 replacement for pandas, but the interfaces are similar enough that a lot of code can swap over without much effort. And yes, it’s typically faster.

Comments closed

Multithreading and Multiprocessing in Python

Jessica Wachtel explains how two systems work in Python:

Let’s use a simple example to understand them: a mechanics shop. Concurrency happens when one mechanic works on several cars by switching between them. For example, the mechanic changes the oil in one car while waiting for a part for another. They don’t finish one car before starting the next, but they can’t do two tasks at exactly the same time. The tasks overlap in time but don’t happen simultaneously.

Click through for the analogy, how it applies to Python, and tips and tricks around each.

Comments closed

Loading Data into Snowflake via Python

Anil Kumar Moka does a bit of data loading:

In our ongoing exploration of Snowflake data loading strategies, we’ve previously examined how to use pandas with SQLAlchemy to efficiently move data into Snowflake tables. That approach leverages pandas’ intuitive DataFrame handling and works well for many common scenarios where you’re already manipulating data in Python before loading it to Snowflake.

In this article, we’re diving deeper into the Snowflake toolbox by exploring the native Snowflake Connector for Python. While pandas offers simplicity and familiarity, the native connector provides a different set of capabilities focused on precision control and Snowflake-specific optimizations. This article explains you when and how to use this more direct approach for everything from small CSV files to massive datasets that would overwhelm pandas.

Click through for the full article.

Comments closed

Handling Imbalanced Data in Python

Ivan Palomares Carrascosa gives three ways to deal with imbalanced data:

Here’s the catch: having imbalanced data usually makes analysis processes more difficult, especially for machine learning models that can easily get biased toward the majority class as a result of dealing with data with a remarkably unequal class distribution, thereby ending up becoming an almost “dummy classifier” that assigns the same class to virtually everything — in the most extreme case.

This article shows several strategies to navigate and handle imbalanced datasets using two of Python’s most stellar libraries for “all things data”: Pandas and Scikit-learn.

Click through for those ways, including sample code.

Comments closed