Press "Enter" to skip to content

Category: Data Science

Paper Review: Moving Fast with Broken Data

Adnan Masood reviews a paper:

I recently came across an insightful research paper titled “Moving Fast With Broken Data” by Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya G. Parameswaran from UC Berkeley and Meta. The paper addresses the significant issue of data corruption in machine learning (ML) pipelines, which often leads to decreased model accuracy. The authors present an automatic data validation system implemented at Meta that aims to solve this problem.

Sounds like I have some beach reading.

Ed. Note: He’s kidding, right?

Ed. 2 Note: About going to the beach maybe.

Ed. & Ed. 2 Note: HAHAHAHAHAH.

Yeah, I hired Statler and Waldorf as my editors. Worst Best decision of my life.

Comments closed

Extending a tinyAML and shiny App

Steven Sanderson wraps up a series on shiny and tinyAML. Part 3 extends options for regression:

As data science continues to be a sought-after field, creating a reliable and accurate model is essential. While there are various machine learning algorithms available, the process of selecting the correct algorithm can be complex. The {tidyAML} package, part of the tidymodels suite, offers an easy-to-use, consistent interface for building machine learning models. In this post, we will explore a Shiny application that utilizes tidyAML to build a machine learning model.

Today I have updated the tidyAML shiny app to include the ability to set the parameter of the fast_regression() function .parsnip_fns and this is things like linear_reg.

And part 4 includes classification:

This is a Shiny app for building models using the {tidyAML} which is based on the tidymodels package in R. The app allows you to upload your own data or choose from one of two built-in datasets (mtcars or iris) and select the type of model you want to build (regression or classification).

Let’s take a closer look at the code.

This was an interesting series, for sure.

Comments closed

Hybrid ML and Rules-Based Fraud Detection

Ayodeji Ogunlami mixes approaches:

In developing this hybrid system, sets of rules are required as well as a machine learning model. I would be making use of a vehicle insurance dataset from Kaggle in this demonstration.

The dataset can be downloaded from this link: https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection

The ML model would be built using a random forest classifier on Azure Databricks using Pyspark.

This seems to be the most sensible approach, especially given how rare actual fraud incidents are and what that imbalance does to classification algorithms.

Comments closed

Passing the Buck: Hyperparameters Edition

John Mount is not a fan of hyperparamters:

In my opinion one can see this scam of hiding some debt in with an asset spreading.

Earliest modeling systems, such as linear regression, had no hyper-parameters. An under specified algorithm was not considered a fully specified method.

Click through for John’s thoughts on the matter. I’m sympathetic to this argument and want to bring in an extra point John didn’t make. With hyperparameter tuning, you also introduce the risk of spurious correlation between the label and input features. This is particularly relevant if changing the seed or making hyperparameter tweaks results in a major change in model effectiveness.

Comments closed

Estimating Simulation Variance when Running Stan Models in R

Sebastian Sauer takes a look at an interesting question:

stan_glm() allows for setting a seed value thereby eliminating the variance induced by random numbers. However, in case a seed is not used, how much variance is to be expected? This is the research question of this analysis.

Let’s choose n=100 repetitions in our simulation.

Click through for the demonstration, including a summary table and notes on installed packages for the sake of reproducibility.

Comments closed

Applying Quality Assurance Practices to Data Science

Devin Partida bridges the gap:

The world runs on data. Data scientists organize and make sense of a barrage of information, synthesizing and translating it so people can understand it. They drive the innovation and decision-making process for many organizations. But the quality of the data they use can greatly influence the accuracy of their findings, which directly impacts business outcomes and operations. That’s why data scientists must follow strong quality assurance practices.

Read on for seven practices which can help data scientists achieve better outcomes.

Comments closed

Thoughts on Linear Regression

John Mount shares some thoughts:

I want to spend some time thinking out loud about linear regression.

As a data science consultant and teacher I spend a lot of time using linear regression and teaching linear regression. I have found each of these pursuits can degenerate into mere doctrine or instructions. “do this,” “expect this,” “don’t do that,” “you should know,” and so on. What I want to do here is take a step back and think out loud about linear regression from first principles. To do attempt this I am going to start with the problem linear regression solves, and try to delay getting to the things so important that “everybody should known them without question.” So let’s think about a few things in a particular order.

For thinking out loud, this is laid out rather well, so give it a read.

Comments closed

The Story behind Benford’s Law

John Cook gives us a dose of history and math:

In 1881, astronomer Simon Newcomb noticed something curious. The first pages in books of logarithms were dirty on the edge, while the pages became progressively cleaner in later pages. He inferred from this that people more often looked up the logarithms of numbers with small leading digits than with large leading digits.

Why might this be? One might reasonably expect the numbers that came up in work to be uniformly distributed. But as often the case, it helps to ask “Uniform on what scale?”

Read on for a bit more of the story behind Newcomb’s Benford’s law and a just-so story about differing bases.

Comments closed

Generating Nested Time Series Models

Steven Sanderson can’t stop at just one time series:

There are many approaches to modeling time series data in R. One of the types of data that we might come across is a nested time series. This means the data is grouped simply by one or more keys. There are many methods in which to accomplish this task. This will be a quick post, but if you want a longer more detailed and quite frankly well written out one, then this is a really good article

The quick post doesn’t include a lot of commentary but does show the code you’d use for the operation.

Comments closed