Data Science – Page 24

Model Diagnostics in Python

Published 2023-07-18 by Kevin Feasel

Christian Lorentzen has released a new package:

Version 1.0.0 of the new Python package for model-diagnostics was just released on PyPI. If you use (machine learning or statistical or other) models to predict a mean, median, quantile or expectile, this library offers tools to assess the calibration of your models and to compare and decompose predictive model performance scores.

This looks like a really useful package, so check it out.

Comments closed

Detecting AI-Generated Profile Photos

Published 2023-06-27 by Kevin Feasel

Shivansh Mundra, et al, report on some research:

With the rise of AI-generated synthetic media and text-to-image generated media, fake profiles have grown more sophisticated. And we’ve found that most members are generally unable to visually distinguish real from synthetically-generated faces, and future iterations of synthetic media are likely to contain fewer obvious artifacts, which might show up as slightly distorted facial features. To protect members from inauthentic interactions online, it is important that the forensic community develop reliable techniques to distinguish real from synthetic faces that can operate on large networks with hundreds of millions of daily users, like LinkedIn.

There are some interesting findings here.

Comments closed

Using SHAP to Gauge Geographic Effects in R or Python

Published 2023-06-08 by Kevin Feasel

Michael Mayer runs an analysis:

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

This post is heavily based on the new {shapviz} vignette.

I appreciate the effort to include both R and Python code in this analysis, and recommend you peruse both sets of code listings.

Comments closed

Poisson Hidden Markov Models in SAS

Published 2023-05-15 by Kevin Feasel

Ji Shen shows off how to perform discrete time series in SAS:

The HMM procedure in SAS Viya supports hidden Markov models (HMMs) and other models embedded with HMM. PROC HMM supports finite HMM, Poisson HMM, Gaussian HMM, Gaussian mixture HMM, the regime-switching regression model, and the regime-switching autoregression model. This post introduces Poisson HMM, the latest addition to PROC HMM in the SAS Viya 2023.03 release.

Count time series is ill-suited for most traditional time series analysis techniques, which assume that the time series values are continuously distributed. This can present unique challenges for organizations that need to model and forecast them. As a popular discrete probability distribution to handle the count time series, the Poisson distribution or the mixed Poisson distribution might not always be suitable. This is because both assume that the events occur independently of each other and at a constant rate. In time series data, however, the occurrence of an event at one point in time might be related to the occurrence of an event at another point in time, and the rates at which events occur might vary over time.

HMM is a valuable tool that can handle overdispersion and serial dependence in the data. This makes it an effective solution for modeling and forecasting count time series. We will explain how the Poisson HMM can handle count time series by modeling different states by using distinct Poisson distributions while considering the probability of transitioning between them.

Read on for an overview of Hidden Markov Models (in general and the Poisson variation in particular) and some of the challenges you can run into when performing this test.

Comments closed

Paper Review: Moving Fast with Broken Data

Published 2023-05-02 by Kevin Feasel

Adnan Masood reviews a paper:

I recently came across an insightful research paper titled “Moving Fast With Broken Data” by Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya G. Parameswaran from UC Berkeley and Meta. The paper addresses the significant issue of data corruption in machine learning (ML) pipelines, which often leads to decreased model accuracy. The authors present an automatic data validation system implemented at Meta that aims to solve this problem.

Sounds like I have some beach reading.

Ed. Note: He’s kidding, right?

Ed. 2 Note: About going to the beach maybe.

Ed. & Ed. 2 Note: HAHAHAHAHAH.

Yeah, I hired Statler and Waldorf as my editors. ~~Worst~~ Best decision of my life.

Comments closed

Extending a tinyAML and shiny App

Published 2023-05-02 by Kevin Feasel

Steven Sanderson wraps up a series on shiny and tinyAML. Part 3 extends options for regression:

As data science continues to be a sought-after field, creating a reliable and accurate model is essential. While there are various machine learning algorithms available, the process of selecting the correct algorithm can be complex. The {tidyAML} package, part of the tidymodels suite, offers an easy-to-use, consistent interface for building machine learning models. In this post, we will explore a Shiny application that utilizes tidyAML to build a machine learning model.

Today I have updated the tidyAML shiny app to include the ability to set the parameter of the fast_regression() function .parsnip_fns and this is things like linear_reg.

And part 4 includes classification:

This is a Shiny app for building models using the {tidyAML} which is based on the tidymodels package in R. The app allows you to upload your own data or choose from one of two built-in datasets (mtcars or iris) and select the type of model you want to build (regression or classification).

Let’s take a closer look at the code.

This was an interesting series, for sure.

Comments closed

Hybrid ML and Rules-Based Fraud Detection

Published 2023-04-28 by Kevin Feasel

Ayodeji Ogunlami mixes approaches:

In developing this hybrid system, sets of rules are required as well as a machine learning model. I would be making use of a vehicle insurance dataset from Kaggle in this demonstration.

The dataset can be downloaded from this link: https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection

The ML model would be built using a random forest classifier on Azure Databricks using Pyspark.

This seems to be the most sensible approach, especially given how rare actual fraud incidents are and what that imbalance does to classification algorithms.

Comments closed

Passing the Buck: Hyperparameters Edition

Published 2023-04-25 by Kevin Feasel

John Mount is not a fan of hyperparamters:

In my opinion one can see this scam of hiding some debt in with an asset spreading.

Earliest modeling systems, such as linear regression, had no hyper-parameters. An under specified algorithm was not considered a fully specified method.

Click through for John’s thoughts on the matter. I’m sympathetic to this argument and want to bring in an extra point John didn’t make. With hyperparameter tuning, you also introduce the risk of spurious correlation between the label and input features. This is particularly relevant if changing the seed or making hyperparameter tweaks results in a major change in model effectiveness.

Comments closed

Estimating Simulation Variance when Running Stan Models in R

Published 2023-03-23 by Kevin Feasel

Sebastian Sauer takes a look at an interesting question:

stan_glm() allows for setting a seed value thereby eliminating the variance induced by random numbers. However, in case a seed is not used, how much variance is to be expected? This is the research question of this analysis.

Let’s choose n=100 repetitions in our simulation.

Click through for the demonstration, including a summary table and notes on installed packages for the sake of reproducibility.

Comments closed

Estimating Dataset Requirements for Data Science Projects

Published 2023-03-22 by Kevin Feasel

John Mount answers a common question:

We will phrase the problem in terms of a measurement of error. For example, let’s try to achieve a given square error in a regression problem. That is, how hard it it to estimate a number given a sample from a larger population?

From there, John covers several scenarios and provides us guidance. The answer isn’t “30 observations.” Usually.

Comments closed

Category: Data Science