Category: Data Science

Bitemporal Modeling and Running Totals

Published 2022-10-24 by Kevin Feasel

John Mount solves a running total problem in Python:

An example of this is wanting to know any many reservations for a San Francisco Symphony concert scheduled for December 4th 2022 are known to have been made by October 22nd 2022. This could be used as part of an attendance demand model that is evaluated on October 22nd 2022. The “fifty-cent word” for this is “bitemporal” modeling or data.

As I read through the solution, my initial thought is that, if the data is in a relational database, a running total operation SUM(reservation_count) OVER (PARTITION BY target_date ORDER BY action_date ROWS BETWEEN UNBOUNDED PRECEDING TO CURRENT ROW) would form the basis of a solution. Still, this is an interesting exercise in translating a SQL operation into equivalent Python and just how much we get to take for granted.

Comments closed

Kernel SHAP in R and Python

Published 2022-10-06 by Kevin Feasel

Michael Mayer and Christian Lorentzen team up:

SHAP is one of the most used model interpretation technique in Machine Learning. It decomposes predictions into additive contributions of the features in a fair way. For tree-based methods, the fast TreeSHAP algorithm exists. For general models, one has to resort to computationally expensive Monte-Carlo sampling or the faster Kernel SHAP algorithm. Kernel SHAP uses a regression trick to get the SHAP values of an observation with a comparably small number of calls to the predict function of the model. Still, it is much slower than TreeSHAP.

Read on to see how to do this in both R and Python. With libraries the way they are, the code is very similar and the results are basically the same.

Comments closed

Finding Near-Duplicates in a Corpus

Published 2022-10-04 by Kevin Feasel

Estelle Wang de-dupes text data:

Building a large high-quality corpus for Natural Language Processing (NLP) is not for the faint of heart. Text data can be large, cumbersome, and unwieldy and unlike clean numbers or categorical data in rows and columns, discerning differences between documents can be challenging. In organizations where documents are shared, modified, and shared again before being saved in an archive, the problem of duplication can become overwhelming.
To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options.

Click through for solutions in SAS.

Comments closed

The Basics of Automating Data Cleaning

Published 2022-10-03 by Kevin Feasel

Vincent Granville provides some guidance:

To the junior data scientist, it looks like each new dataset comes with a new set of challenges. It seems that you can not automate data cleaning. To decisions makers and stakeholders, this problem is so remote to them that they don’t even know the amount of resources wasted on this. To them, it seems obvious that automation is the way to go, but they may underestimate the challenges. It is usually not a high priority in many organizations, despite how much money it costs.
Yet, there are at most a few dozens of issues that come with data cleaning. Not a few thousands, not a few hundreds. You can catalog them and address all of them at once with a piece of code. One that you can reuse each time when you face a new data set. I describe here the main issues and how to address them. Automating the data cleaning step can save you a lot of time, and eliminate boring, repetitive tasks to make your data scientists happier.

Click through for Vincent’s thoughts and recommendations.

Comments closed

Survival Analysis Model Explanations with survex

Published 2022-09-22 by Kevin Feasel

Mikolaj Spytek promotes an R package:

You can learn about it in this blog, but long story short, survival models (most often) predict a survival function. It tells us what is the probability of an event not happening until a given time t. The output can also be a single value (e.g., risk score) but these scores are always some aggregates of the survival function and this naturally leads to a loss of information included in the prediction.
The complexity of the output of survival models means that standard explanation methods cannot be applied directly.
Because of this, we (I and the team: Mateusz Krzyziński, Hubert Baniecki, and Przemyslaw Biecek) developed an R package — survex, which provides explanations for survival models. We hope this tool allows for more widespread usage of complex machine learning survival analysis models. Until now, simpler statistical models such as Cox Proportional Hazards were preferred due to their interpretability — vital in areas such as medicine, even though they were frequently outperformed by complex machine learning models.

Read on to dive into the topic. H/T R-Bloggers.

Comments closed

Network Analysis in R via netUtils

Published 2022-08-31 by Kevin Feasel

David Schoch has an R package for us:

During the last 5 years, I have accumulated various scripts with (personal) convenience functions for network analysis and I also implemented new methods from time to time which I could not find in any other package in R. The package netUtils gathers all these functions and makes them available for anyone who may also needs to apply “non-standard” network analytic tools. In this post, I will briefly highlight some of the most prominent functions of the package. All available functions are listed in the README on github.

Click through to see what’s available in the package. H/T R-Bloggers.

Comments closed

What’s in a Name?

Published 2022-08-11 by Kevin Feasel

Benjamin Smith analyzes a name change:

Recently, RStudio announced its name change to Posit. For many this name change was accepted with open arms, but for some-not so. Being the statistician that I am I decided to post a poll on LinkedIn to see the sentiment of my network. After running the poll for a week the results were in:

Read on for the responses as well as an analysis using RSTAN.

Comments closed

Anomaly Detection over Delta Live Tables

Published 2022-08-10 by Kevin Feasel

Avinash Sooriyarachchi and Sathish Gangichetty show off an interesting scenario:

Anomaly detection poses several challenges. The first is the data science question of what an ‘anomaly’ looks like. Fortunately, machine learning has powerful tools to learn how to distinguish usual from anomalous patterns from data. In the case of anomaly detection, it is impossible to know what all anomalies look like, so it’s impossible to label a data set for training a machine learning model, even if resources for doing so are available. Thus, unsupervised learning has to be used to detect anomalies, where patterns are learned from unlabelled data.
Even with the perfect unsupervised machine learning model for anomaly detection figured out, in many ways, the real problems have only begun. What is the best way to put this model into production such that each observation is ingested, transformed and finally scored with the model, as soon as the data arrives from the source system? That too, in a near real-time manner or at short intervals, e.g. every 5-10 minutes? This involves building a sophisticated extract, load, and transform (ELT) pipeline and integrating it with an unsupervised machine learning model that can correctly identify anomalous records. Also, this end-to-end pipeline has to be production-grade, always running while ensuring data quality from ingestion to model inference, and the underlying infrastructure has to be maintained.

Click through to see their solution using Databricks and delta lake.

Comments closed

Understanding Decision Trees

Published 2022-08-09 by Kevin Feasel

Durgesh Gupta provides a primer on the humble decision tree:

A decision tree is a graphical representation of all possible solutions to a decision.
The objective of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training data.
It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

The way I like to describe decision trees, especially to developers, is that a tree is a set of if-else statements which leads to a conclusion. The nice part about decision trees is that once you understand how they work, you’re halfway there to gradient boosting (e.g., XGBoost) and random forests.

Comments closed

Worrying over Columns or Rows

Published 2022-08-03 by Kevin Feasel

John Mount explains an attitude difference:

I say: if you are a data scientist or working on an analytics project, worry over columns not rows.
In analytics “rows” are instances, and “columns” are possible measurements. For example: each click on a website might generate a row recording the visit, and this row would be populated with columns describing what was clicked on (and if you are lucky there are more records recording what else was presented and not clicked on).

Read the whole thing. This is also why formats like Parquet and ORC are so popular for data analysis. Same goes for business intelligence people, who reason mostly over columns, leading to columnstore indexes being so useful.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31