Press "Enter" to skip to content

Category: Data Science

2023 Data Professional Survey Results

Brent Ozar busts out the briefcase full of Benjamins:

Are your peers being paid more this year? Are they switching job roles? Are they planning on leaving their companies? To find out, I run a salary survey every year for folks in the database industry. Download the raw data here and slice & dice ’em to see what’s important to you.

As a quick note, however, remember that inflation in the US went up considerably. Inflation wasn’t something we had to factor in from 2017-2021, as it was 1.5-2%. In 2021, it increased to more than 4% and in 2022 was closer to 8-9%, so converting these from nominal (pre-inflation) to real (post-inflation) will help tell the full story.

Comments closed

Interpreting Linear Models with SHAP

Michael Mayer answers a question:

XGBoost models are often interpreted with SHAP (Shapley Additive eXplanations): Each of e.g. 1000 randomly selected predictions is fairly decomposed into contributions of the features using the extremely fast TreeSHAP algorithm, providing a rich interpretation of the model as a whole. TreeSHAP was introduced in the Nature publication by Lundberg and Lee (2020).

Can we do the same for non-tree-based models like a complex GLM or a neural network? Yes, but we have to resort to slower model-agnostic SHAP algorithms:

Read on for examples of those algorithms and an example of interpretation and analysis.

Comments closed

Multivariate Anomaly Detection with ADX

Adi Eldar shows off multivariate anomaly detection in Azure Data Explorer:

Azure Data Explorer (ADX) is commonly used for monitoring cloud resources and IoT devices performance and health. This is done by continuous collection of multiple metrics emitted by these sources, and on-going analysis of the collected data to detect anomalies. The analysis is applied over time series of the relevant metrics in order to locate significant deviations of the metrics values relative to their typical normal baseline pattern.

Click through for a nice overview of the topic, including two different scenarios: one which emphasizes time series data and the other, which does not.

Comments closed

Fun with Decision Trees

Holger von Jouanne-Diedrich explains the value of decision trees, using predictive maintenance as an example:

Predictive Maintenance is one of the big revolutions happening across all major industries right now. Instead of changing parts regularly or even only after they failed it uses Machine Learning methods to predict when a part is going to fail.

If you want to get an introduction to this fascinating developing area, read on!

Click through for an example of how it works.

Comments closed

Difficulties around A/B Testing

John Cook asks which is clearer, 1 or 2? 3 or 4? 4 or 6?

One problem with A/B testing is that your results may depend on the order of your tests.

Suppose you’re testing three options: XY, and Z. Let’s say you have three market segments, equal in size, each with the following preferences.

This is known as the Condorcet paradox of voting.

John also introduces the problem of interaction effects:

Suppose you’re debating between putting a photo of a car or a truck on your web site, and you’re debating between whether the vehicle should be red or blue. You decide to use A/B testing, so you test whether customers prefer a red truck or a blue truck. They prefer the blue truck. Then you test whether customers prefer a blue truck or a blue car. They prefer the blue truck.

Maybe customers would prefer a red car best of all, but you didn’t test that option. By testing vehicle type and color separately, you didn’t learn about the interaction of vehicle type and color. 

Click through for both posts as well as some good insights.

Comments closed

Appending Rows to a Pandas DataFrame

Matt Eland acquires some rows that fell off a truck:

Recently I was working on comparing the performance of different machine learning models and I wanted to add entries to a Pandas DataFrame as I evaluated each model. What I found was that adding new rows to a Pandas DataFrame was a little harder than I suspected and required some mild searching, so I wanted to preserve the two solutions I found here in case it helps someone else.

Read on for those two solutions, though as Matt points out, only one of them is a good solution.

Comments closed

Bitemporal Modeling and Running Totals

John Mount solves a running total problem in Python:

An example of this is wanting to know any many reservations for a San Francisco Symphony concert scheduled for December 4th 2022 are known to have been made by October 22nd 2022. This could be used as part of an attendance demand model that is evaluated on October 22nd 2022. The “fifty-cent word” for this is “bitemporal” modeling or data.

As I read through the solution, my initial thought is that, if the data is in a relational database, a running total operation SUM(reservation_count) OVER (PARTITION BY target_date ORDER BY action_date ROWS BETWEEN UNBOUNDED PRECEDING TO CURRENT ROW) would form the basis of a solution. Still, this is an interesting exercise in translating a SQL operation into equivalent Python and just how much we get to take for granted.

Comments closed

Kernel SHAP in R and Python

Michael Mayer and Christian Lorentzen team up:

SHAP is one of the most used model interpretation technique in Machine Learning. It decomposes predictions into additive contributions of the features in a fair way. For tree-based methods, the fast TreeSHAP algorithm exists. For general models, one has to resort to computationally expensive Monte-Carlo sampling or the faster Kernel SHAP algorithm. Kernel SHAP uses a regression trick to get the SHAP values of an observation with a comparably small number of calls to the predict function of the model. Still, it is much slower than TreeSHAP.

Read on to see how to do this in both R and Python. With libraries the way they are, the code is very similar and the results are basically the same.

Comments closed

Finding Near-Duplicates in a Corpus

Estelle Wang de-dupes text data:

Building a large high-quality corpus for Natural Language Processing (NLP) is not for the faint of heart. Text data can be large, cumbersome, and unwieldy and unlike clean numbers or categorical data in rows and columns, discerning differences between documents can be challenging. In organizations where documents are shared, modified, and shared again before being saved in an archive, the problem of duplication can become overwhelming.

To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options.

Click through for solutions in SAS.

Comments closed

The Basics of Automating Data Cleaning

Vincent Granville provides some guidance:

To the junior data scientist, it looks like each new dataset comes with a new set of challenges. It seems that you can not automate data cleaning. To decisions makers and stakeholders, this problem is so remote to them that they don’t even know the amount of resources wasted on this. To them, it seems obvious that automation is the way to go, but they may underestimate the challenges. It is usually not a high priority in many organizations, despite how much money it costs.

Yet, there are at most a few dozens of issues that come with data cleaning. Not a few thousands, not a few hundreds. You can catalog them and address all of them at once with a piece of code. One that you can reuse each time when you face a new data set. I describe here the main issues and how to address them. Automating the data cleaning step can save you a lot of time, and eliminate boring, repetitive tasks to make your data scientists happier.

Click through for Vincent’s thoughts and recommendations.

Comments closed