Press "Enter" to skip to content

Category: Data Science

Hybrid ML and Rules-Based Fraud Detection

Ayodeji Ogunlami mixes approaches:

In developing this hybrid system, sets of rules are required as well as a machine learning model. I would be making use of a vehicle insurance dataset from Kaggle in this demonstration.

The dataset can be downloaded from this link: https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection

The ML model would be built using a random forest classifier on Azure Databricks using Pyspark.

This seems to be the most sensible approach, especially given how rare actual fraud incidents are and what that imbalance does to classification algorithms.

Comments closed

Passing the Buck: Hyperparameters Edition

John Mount is not a fan of hyperparamters:

In my opinion one can see this scam of hiding some debt in with an asset spreading.

Earliest modeling systems, such as linear regression, had no hyper-parameters. An under specified algorithm was not considered a fully specified method.

Click through for John’s thoughts on the matter. I’m sympathetic to this argument and want to bring in an extra point John didn’t make. With hyperparameter tuning, you also introduce the risk of spurious correlation between the label and input features. This is particularly relevant if changing the seed or making hyperparameter tweaks results in a major change in model effectiveness.

Comments closed

Estimating Simulation Variance when Running Stan Models in R

Sebastian Sauer takes a look at an interesting question:

stan_glm() allows for setting a seed value thereby eliminating the variance induced by random numbers. However, in case a seed is not used, how much variance is to be expected? This is the research question of this analysis.

Let’s choose n=100 repetitions in our simulation.

Click through for the demonstration, including a summary table and notes on installed packages for the sake of reproducibility.

Comments closed

Applying Quality Assurance Practices to Data Science

Devin Partida bridges the gap:

The world runs on data. Data scientists organize and make sense of a barrage of information, synthesizing and translating it so people can understand it. They drive the innovation and decision-making process for many organizations. But the quality of the data they use can greatly influence the accuracy of their findings, which directly impacts business outcomes and operations. That’s why data scientists must follow strong quality assurance practices.

Read on for seven practices which can help data scientists achieve better outcomes.

Comments closed

Thoughts on Linear Regression

John Mount shares some thoughts:

I want to spend some time thinking out loud about linear regression.

As a data science consultant and teacher I spend a lot of time using linear regression and teaching linear regression. I have found each of these pursuits can degenerate into mere doctrine or instructions. “do this,” “expect this,” “don’t do that,” “you should know,” and so on. What I want to do here is take a step back and think out loud about linear regression from first principles. To do attempt this I am going to start with the problem linear regression solves, and try to delay getting to the things so important that “everybody should known them without question.” So let’s think about a few things in a particular order.

For thinking out loud, this is laid out rather well, so give it a read.

Comments closed

The Story behind Benford’s Law

John Cook gives us a dose of history and math:

In 1881, astronomer Simon Newcomb noticed something curious. The first pages in books of logarithms were dirty on the edge, while the pages became progressively cleaner in later pages. He inferred from this that people more often looked up the logarithms of numbers with small leading digits than with large leading digits.

Why might this be? One might reasonably expect the numbers that came up in work to be uniformly distributed. But as often the case, it helps to ask “Uniform on what scale?”

Read on for a bit more of the story behind Newcomb’s Benford’s law and a just-so story about differing bases.

Comments closed

Generating Nested Time Series Models

Steven Sanderson can’t stop at just one time series:

There are many approaches to modeling time series data in R. One of the types of data that we might come across is a nested time series. This means the data is grouped simply by one or more keys. There are many methods in which to accomplish this task. This will be a quick post, but if you want a longer more detailed and quite frankly well written out one, then this is a really good article

The quick post doesn’t include a lot of commentary but does show the code you’d use for the operation.

Comments closed

Calculating Log Likelihood Ratios with jeva

Peter M.B. Cahusac takes us through a jamovi package:

Ever wanted to try doing an evidential analysis? You may have found it difficult to find a statistical platform to do it. Now there is the jamovi module jeva which can provide log likelihood ratios for a range of common statistical tests.

Imagine for a moment that we wish to carry out a statistical test on our sample of data. We do not want to know whether the procedure we routinely use gives us the correct answer with a specified error rate (such as the Type I error) – the frequentist approach. Nor do we want to concern ourselves with possible a priori probabilities of hypotheses being true – the Bayesian approach. We need to know whether a statistic from this particular set of data is consistent with one or more hypothetical values. Also, let’s say that we weren’t happy with how much data we had collected (a familiar problem?), and just added more when convenient. Welcome to the likelihood (or evidential) approach!

Read on for an explanation and how to try jeva out.

Comments closed

Estimating Quantiles in Python

Christian Lorentzen digs into quantile calculation:

Applied statistics is dominated by the ubiquitous mean. For a change, this post is dedicated to quantiles. I will give my best to provide a good mix of theory and practical examples.

While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They appear in box-plots, in childrens’ weight-for-age curves, in salary survey results, in risk measures like the value-at-risk in the EU-wide solvency II framework for insurance companies, in quality control and in many more fields.

There are easy functions to calculate quantiles in R and Python; this post serves as a way of understanding the variety of quantile functions available and how they can affect results with small sample sizes.

Comments closed