Press "Enter" to skip to content

Category: Python

Parquet Files in Pandas

Chris LaGreca works with Parquet files:

Apache Parquet has become one of the defacto standards in modern data architecture. This open source, columnar data format serves as the backbone of many high-powered analytics and machine learning pipelines, supported by many of the worlds most sophisticated platforms and services. AWS, Azure, and Google Cloud all offer built-in support for Parquet while big data tools like Hadoop, Spark, Hive, and Databricks natively support Parquet, allowing seamless data processing and analytics. Parquet is also foundational in data lakehouse formats like Delta Lake, Iceberg, and Hudi, where its features are further enhanced.

Parquet is efficient and has broad industry support. In this post, I will showcase a few simple techniques to demonstrate working with Parquet and leveraging its special features using Pandas.

Pandas does make this rather easy, as Chris shows.

Comments closed

Parallel Download in Oracle Object Storage

Brendan Tierney continues a series on Oracle Object Storage:

In previous posts, I’ve given example Python code (and functions) for processing files into and out of OCI Object and Bucket Storage. One of these previous posts includes code and a demonstration of uploading files to an OCI Bucket using the multiprocessing package in Python.

Building upon these previous examples, the code below will download a Bucket using parallel processing. Like my last example, this code is based on the example code I gave in an earlier post on functions within a Jupyter Notebook.

Click through for the code.

Comments closed

Tips for Choosing a Classifier

I’ve wrapped up yet another series:

In this video, I wrap up the series on classification and provide some quick-and-dirty tips on when to use each of the classification algorithms we have discussed.

This was a series I really enjoyed. I’ve had a talk on the topic for a few years, but getting the opportunity to dig in deeper and spend a few hours on the topic was nice. It also helped me fill in some gaps in my understanding and fix a few long-standing bugs in my demo code, so it’s got that going for it as well.

Comments closed

Suspend and Resume Microsoft Fabric Capacity

Olivier Van Steenlandt saves some cash:

With only a limited budget for exploring and testing new tools, I had to figure out how to use my budget efficiently. Therefore, before making any decisions, I looked at the Microsoft Fabric pricing and possibilities.

If you want to take a look at the Microsoft Fabric pricing models, you can find an overview via the following link: Microsoft Fabric – Pricing | Microsoft Azure

To avoid any surprises and to be as cost-effective as possible, I created an easy Python script that I can use to pause and start my Microsoft Fabric capacity, or better said resume and suspend.

I highly recommend this for any organization that does not need 24/7 uptime for Fabric capacity. If you run your system 12 hours a day instead of 24, it takes your F64 capacity from $8k a month to $4k.

Comments closed

SHAP and Additive Models

Michael Mayer answers a pair of related questions:

Within only a few years, SHAP (Shapley additive explanations) has emerged as the number 1 way to investigate black-box models. The basic idea is to decompose model predictions into additive contributions of the features in a fair way. Studying decompositions of many predictions allows to derive global properties of the model.

What happens if we apply SHAP algorithms to additive models? Why would this ever make sense?

Read on for the answers to these two questions.

Comments closed

Random Date Generation in Python

Chris LaGreca spits out some dates:

I often work with time series data and find it useful to have a variety of ways to randomly generate dates. This particular example is great for evenly distributed date partitions. Running the script below with the default arguments will output a list of random dates, one for each month of the year.

It looks like this is generating based off of a uniform distribution, which probably makes the most sense for “give me a day of the month” data generation.

Comments closed

New Video: Multi-Class Classification

I have a new video:

In this video, I get past two-class classification and explain how things differ in the multi-class world.

What’s really interesting is that, in many cases, when it comes to code, the answer is “not much.” That’s because libraries like scikit-learn do a lot to smooth over differences between single-class and multi-class classification. But there are still differences that can bite you if you don’t understand how the cases differ.

Comments closed