Press "Enter" to skip to content

Category: Data Science

Difficulties around A/B Testing

John Cook asks which is clearer, 1 or 2? 3 or 4? 4 or 6?

One problem with A/B testing is that your results may depend on the order of your tests.

Suppose you’re testing three options: XY, and Z. Let’s say you have three market segments, equal in size, each with the following preferences.

This is known as the Condorcet paradox of voting.

John also introduces the problem of interaction effects:

Suppose you’re debating between putting a photo of a car or a truck on your web site, and you’re debating between whether the vehicle should be red or blue. You decide to use A/B testing, so you test whether customers prefer a red truck or a blue truck. They prefer the blue truck. Then you test whether customers prefer a blue truck or a blue car. They prefer the blue truck.

Maybe customers would prefer a red car best of all, but you didn’t test that option. By testing vehicle type and color separately, you didn’t learn about the interaction of vehicle type and color. 

Click through for both posts as well as some good insights.

Comments closed

Appending Rows to a Pandas DataFrame

Matt Eland acquires some rows that fell off a truck:

Recently I was working on comparing the performance of different machine learning models and I wanted to add entries to a Pandas DataFrame as I evaluated each model. What I found was that adding new rows to a Pandas DataFrame was a little harder than I suspected and required some mild searching, so I wanted to preserve the two solutions I found here in case it helps someone else.

Read on for those two solutions, though as Matt points out, only one of them is a good solution.

Comments closed

Bitemporal Modeling and Running Totals

John Mount solves a running total problem in Python:

An example of this is wanting to know any many reservations for a San Francisco Symphony concert scheduled for December 4th 2022 are known to have been made by October 22nd 2022. This could be used as part of an attendance demand model that is evaluated on October 22nd 2022. The “fifty-cent word” for this is “bitemporal” modeling or data.

As I read through the solution, my initial thought is that, if the data is in a relational database, a running total operation SUM(reservation_count) OVER (PARTITION BY target_date ORDER BY action_date ROWS BETWEEN UNBOUNDED PRECEDING TO CURRENT ROW) would form the basis of a solution. Still, this is an interesting exercise in translating a SQL operation into equivalent Python and just how much we get to take for granted.

Comments closed

Kernel SHAP in R and Python

Michael Mayer and Christian Lorentzen team up:

SHAP is one of the most used model interpretation technique in Machine Learning. It decomposes predictions into additive contributions of the features in a fair way. For tree-based methods, the fast TreeSHAP algorithm exists. For general models, one has to resort to computationally expensive Monte-Carlo sampling or the faster Kernel SHAP algorithm. Kernel SHAP uses a regression trick to get the SHAP values of an observation with a comparably small number of calls to the predict function of the model. Still, it is much slower than TreeSHAP.

Read on to see how to do this in both R and Python. With libraries the way they are, the code is very similar and the results are basically the same.

Comments closed

Finding Near-Duplicates in a Corpus

Estelle Wang de-dupes text data:

Building a large high-quality corpus for Natural Language Processing (NLP) is not for the faint of heart. Text data can be large, cumbersome, and unwieldy and unlike clean numbers or categorical data in rows and columns, discerning differences between documents can be challenging. In organizations where documents are shared, modified, and shared again before being saved in an archive, the problem of duplication can become overwhelming.

To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options.

Click through for solutions in SAS.

Comments closed

The Basics of Automating Data Cleaning

Vincent Granville provides some guidance:

To the junior data scientist, it looks like each new dataset comes with a new set of challenges. It seems that you can not automate data cleaning. To decisions makers and stakeholders, this problem is so remote to them that they don’t even know the amount of resources wasted on this. To them, it seems obvious that automation is the way to go, but they may underestimate the challenges. It is usually not a high priority in many organizations, despite how much money it costs.

Yet, there are at most a few dozens of issues that come with data cleaning. Not a few thousands, not a few hundreds. You can catalog them and address all of them at once with a piece of code. One that you can reuse each time when you face a new data set. I describe here the main issues and how to address them. Automating the data cleaning step can save you a lot of time, and eliminate boring, repetitive tasks to make your data scientists happier.

Click through for Vincent’s thoughts and recommendations.

Comments closed

Survival Analysis Model Explanations with survex

Mikolaj Spytek promotes an R package:

You can learn about it in this blog, but long story short, survival models (most often) predict a survival function. It tells us what is the probability of an event not happening until a given time t. The output can also be a single value (e.g., risk score) but these scores are always some aggregates of the survival function and this naturally leads to a loss of information included in the prediction.

The complexity of the output of survival models means that standard explanation methods cannot be applied directly.

Because of this, we (I and the team: Mateusz KrzyzińskiHubert Baniecki, and Przemyslaw Biecek) developed an R package — survex, which provides explanations for survival models. We hope this tool allows for more widespread usage of complex machine learning survival analysis models. Until now, simpler statistical models such as Cox Proportional Hazards were preferred due to their interpretability — vital in areas such as medicine, even though they were frequently outperformed by complex machine learning models.

Read on to dive into the topic. H/T R-Bloggers.

Comments closed

Network Analysis in R via netUtils

David Schoch has an R package for us:

During the last 5 years, I have accumulated various scripts with (personal) convenience functions for network analysis and I also implemented new methods from time to time which I could not find in any other package in R. The package netUtils gathers all these functions and makes them available for anyone who may also needs to apply “non-standard” network analytic tools. In this post, I will briefly highlight some of the most prominent functions of the package. All available functions are listed in the README on github.

Click through to see what’s available in the package. H/T R-Bloggers.

Comments closed

What’s in a Name?

Benjamin Smith analyzes a name change:

Recently, RStudio announced its name change to Posit. For many this name change was accepted with open arms, but for some-not so. Being the statistician that I am I decided to post a poll on LinkedIn to see the sentiment of my network. After running the poll for a week the results were in:

Read on for the responses as well as an analysis using RSTAN.

Comments closed

Anomaly Detection over Delta Live Tables

Avinash Sooriyarachchi and Sathish Gangichetty show off an interesting scenario:

Anomaly detection poses several challenges. The first is the data science question of what an ‘anomaly’ looks like. Fortunately, machine learning has powerful tools to learn how to distinguish usual from anomalous patterns from data. In the case of anomaly detection, it is impossible to know what all anomalies look like, so it’s impossible to label a data set for training a machine learning model, even if resources for doing so are available. Thus, unsupervised learning has to be used to detect anomalies, where patterns are learned from unlabelled data.

Even with the perfect unsupervised machine learning model for anomaly detection figured out, in many ways, the real problems have only begun. What is the best way to put this model into production such that each observation is ingested, transformed and finally scored with the model, as soon as the data arrives from the source system? That too, in a near real-time manner or at short intervals, e.g. every 5-10 minutes? This involves building a sophisticated extract, load, and transform (ELT) pipeline and integrating it with an unsupervised machine learning model that can correctly identify anomalous records. Also, this end-to-end pipeline has to be production-grade, always running while ensuring data quality from ingestion to model inference, and the underlying infrastructure has to be maintained.

Click through to see their solution using Databricks and delta lake.

Comments closed