Press "Enter" to skip to content

Category: Python

Error Handling in PySpark Jobs

Ram Ghadiyaram adds some error handling logic:

In PySpark, processing massive datasets across distributed clusters is powerful but comes with challenges. A single bad record, missing file, or network glitch can crash an entire job, wasting compute resources and leaving you with stack traces that have many lines. 

Spark’s lazy evaluation, where transformations don’t execute until an action is triggered, makes errors harder to catch early, and debugging them can feel like very, very difficult.

Read on for five patterns that can help with error handling in PySpark.

Leave a Comment

Choosing between Data Scalers in a Data Science Project

Bala Pirya C performs a comparison:

In this article, you will learn how MinMaxScaler, StandardScaler, and RobustScaler transform skewed, outlier-heavy data, and how to pick the right one for your modeling pipeline.

Topics we will cover include:

  • How each scaler works and where it breaks on skewed or outlier-rich data
  • A realistic synthetic dataset to stress-test the scalers
  • A practical, code-ready heuristic for choosing a scaler

Read on to learn more about each of these three scaler types, the use cases that best fit each of them, and even a flow chart at the end.

Leave a Comment

Cross-Validation and Time Series Data

Vlad Johnson takes us through a technique to test time series results:

Time series modeling, compared to traditional nontemporal modeling, presents unique challenges in ensuring that models generalize well to future, unseen data. One key methodology to address these challenges is cross-validation.

Time series data inherently contains temporal dependencies — observations are ordered in time, and future values may depend on past trends. This structure makes it challenging to estimate how well a model will perform on new, unseen data.

Click through for an explanation of cross-validation, why this becomes challenging when you have time series data (or other serially correlated data), and tips to resolve this challenge.

Leave a Comment

Ingesting IoT Data into SQL Server via Python

Hristo Hristov builds an app:

MQTT is a lightweight Industrial IoT communications protocol allowing efficient communication to and from edge devices such as machines, sensors, and actuators. How can we get data from an MQTT on-premises or cloud broker and persist them in an SQL Server database? How can we leverage the newest features in SQL Server 2025 to make efficient query compilations and build a scalable solution for a data pipeline for permanently storing IoT data?

Read on for the code, most of which is in Python.

Leave a Comment

Comparing the ROC Curve to a Precision-Recall Curve

Ivan Palomares Carrascosa looks at two ways to plot classification model trade-offs:

When building machine learning models to classify imbalanced data — i.e. datasets where the presence of one class (like spam email for example) is much less frequent than the presence of the other class (non-spam email, for instance) — certain traditional metrics like accuracy or even the ROC AUC (Receiving Operating Characteristic curve and the area under it) may not reflect the model performance in realistic terms, giving overly optimistic estimates due to the dominance of the so-called negative class.

Precision-recall curves (or PR curves for short), on the other hand, are designed to focus specifically on the positive, typically rarer class, which is a much more informative measure for skewed datasets due to class imbalance.

Read on to see how these two curves can diverge and when you might trust one over the other. Ivan’s post does rely on the idea of the positive class being the smaller one and the dataset being markedly unbalanced

Comments closed

Challenges of High-Dimensional Optimization

John Mount lays out a demonstration:

My experience is that common objective functions tend to be structured and full of coincidences and symmetries. And because they have these structures they are hard to optimize.

Let’s work up what I claim to be a fairly typical optimization problem that arises from planning or scheduling. I’ll call it the train arrival schedule problem.

Click through for the article, which includes demonstration code.

Comments closed

Modifying Power BI Page Visibility and Active Status via Semantic Link Labs

Meagan Longoria hides (or shows) a page:

Setting page visibility and the active page are often overlooked last steps when publishing a Power BI report. It’s easy to forget the active page since it’s just set to whatever page was open when you last saved the report. But we don’t have to settle for manually checking these things before we deploy to a new workspace (e.g., from dev to prod). If our report is in PBIR format, we can run Fabric notebooks to do this for us.

Click through for a notebook and an explanation.

Comments closed

An Introduction to Batch Normalization in Neural Networks

Ivan Palomares Carrascosa shows off one technique for optimizing neural networks:

Deep neural networks have drastically evolved over the years, overcoming common challenges that arise when training these complex models. This evolution has enabled them to solve increasingly difficult problems effectively.

One of the mechanisms that has proven especially influential in the advancement of neural network-based models is batch normalization. This article provides a gentle introduction to this strategy, which has become a standard in many modern architectures, helping to improve model performance by stabilizing training, speeding up convergence, and more.

Read on for a quick description of how it works and a demonstration in Keras.

Comments closed

Making XGBoost Run Faster

Ivan Palomares Carrascosa shares a few tips:

Extreme gradient boosting (XGBoost) is one of the most prominent machine learning techniques used not only for experimentation and analysis but also in deployed predictive solutions in industry. An XGBoost ensemble combines multiple models to address a predictive task like classification, regression, or forecasting. It trains a set of decision trees sequentially, gradually improving the quality of predictions by correcting the errors made by previous trees in the pipeline.

In a recent article, we explored the importance and ways to interpret predictions made by XGBoost models (note we use the term ‘model’ here for simplicity, even though XGBoost is an ensemble of models). This article takes another practical dive into XGBoost, this time by illustrating three strategies to speed up and improve its performance.

Read on for two tips to reduce operational load and one to offload it to faster hardware (when possible).

Comments closed

An Introduction to Bayesian Regression

Ivan Palomares Carrascosa covers the concept of Bayesian regression:

In this article, you will learn:

  • The fundamental difference between traditional regression, which uses single fixed values for its parameters, and Bayesian regression, which models them as probability distributions.
  • How this probabilistic approach allows the model to produce a full distribution of possible outcomes, thereby quantifying the uncertainty in its predictions.
  • How to implement a simple Bayesian regression model in Python with scikit-learn.

My understanding is that both Bayesian and traditional regression techniques get you to (roughly) the same place, but the Bayesian approach makes it harder to forget that the regression line you draw doesn’t actually exist and everything has uncertainty.

Comments closed