Press "Enter" to skip to content

Category: Data Science

Calculating Log Likelihood Ratios with jeva

Peter M.B. Cahusac takes us through a jamovi package:

Ever wanted to try doing an evidential analysis? You may have found it difficult to find a statistical platform to do it. Now there is the jamovi module jeva which can provide log likelihood ratios for a range of common statistical tests.

Imagine for a moment that we wish to carry out a statistical test on our sample of data. We do not want to know whether the procedure we routinely use gives us the correct answer with a specified error rate (such as the Type I error) – the frequentist approach. Nor do we want to concern ourselves with possible a priori probabilities of hypotheses being true – the Bayesian approach. We need to know whether a statistic from this particular set of data is consistent with one or more hypothetical values. Also, let’s say that we weren’t happy with how much data we had collected (a familiar problem?), and just added more when convenient. Welcome to the likelihood (or evidential) approach!

Read on for an explanation and how to try jeva out.

Comments closed

Estimating Quantiles in Python

Christian Lorentzen digs into quantile calculation:

Applied statistics is dominated by the ubiquitous mean. For a change, this post is dedicated to quantiles. I will give my best to provide a good mix of theory and practical examples.

While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They appear in box-plots, in childrens’ weight-for-age curves, in salary survey results, in risk measures like the value-at-risk in the EU-wide solvency II framework for insurance companies, in quality control and in many more fields.

There are easy functions to calculate quantiles in R and Python; this post serves as a way of understanding the variety of quantile functions available and how they can affect results with small sample sizes.

Comments closed

A Primer on Stan

Jack Kennedy explains the concepts of Stan and JAGS:

You may have used a probabilistic programming language (PPL) in the past, such as BUGS, to perform Bayesian inference. You’ve heard about Stan and want to learn a little more. Or maybe you’re about to step into the Bayesian paradigm and don’t know where to start. You want to know whether you should make the switch from JAGS to Stan, or you’ve used neither of JAGS or Stan and want to know which will suit you best. This post will focus solely on the differences between JAGS and Stan as I have experience with both of them, but there are many more PPLs out there. For example, I have never used Bean Machine, but of all the PPLs, it certainly takes the crown for best name.

Stan has been on my to-learn list for a while and I did successfully get one of my employees (a rassa-frassin’ frequentist) to use and enjoy the power of Bayesian analysis. One of these days, I’ll have to get back to it.

Comments closed

Approximation with the Mediant

John Cook didn’t make a typo:

Suppose you are trying to approximate some number x and you’ve got it sandwiched between two rational numbers:

a/b < x < c/d.

Now you’d like a better approximation. What would you do?

The obvious approach would be to take the average of a/b and c/d. That’s fine, except it could be a fair amount of work if you’re doing this in your head.

Read on for a separate approach taking the mediant (not median) of the two fractions.

Comments closed

Q&A on Data Engineering

Dustin Vannoy talks to the mirror:

An aspiring data engineer recently reached out to me for some guidance on pivoting into the field from a software development background. The questions they asked are similar to what others have asked me in the past, so I decided to capture my responses here. I link to prior posts and other resources when possible to try and keep the responses brief. These are informal thoughts of mine, not something I have sat down to rethink and research for new ideas beyond what is already in my head.

Dustin is one of the best people to talk to about data engineering. Click through for his advice.

Comments closed

K-Fold Cross-Validation in Python

Shanthababu Pandian gives us a primer on k-fold cross-validation:

In each set (fold) training and the test would be performed precisely once during this entire process. It helps us to avoid overfitting. As we know when a model is trained using all of the data in a single shot and gives the best performance accuracy. Resisting this k-fold cross-validation helps us to build the model as a generalized one.

To achieve this K-Fold Cross Validation, we have to split the data set into three sets, Training, Testing, and Validation, with the challenge of the volume of the data.

Read on for the explanation and an example.

Comments closed

2023 Data Professional Survey Results

Brent Ozar busts out the briefcase full of Benjamins:

Are your peers being paid more this year? Are they switching job roles? Are they planning on leaving their companies? To find out, I run a salary survey every year for folks in the database industry. Download the raw data here and slice & dice ’em to see what’s important to you.

As a quick note, however, remember that inflation in the US went up considerably. Inflation wasn’t something we had to factor in from 2017-2021, as it was 1.5-2%. In 2021, it increased to more than 4% and in 2022 was closer to 8-9%, so converting these from nominal (pre-inflation) to real (post-inflation) will help tell the full story.

Comments closed

Interpreting Linear Models with SHAP

Michael Mayer answers a question:

XGBoost models are often interpreted with SHAP (Shapley Additive eXplanations): Each of e.g. 1000 randomly selected predictions is fairly decomposed into contributions of the features using the extremely fast TreeSHAP algorithm, providing a rich interpretation of the model as a whole. TreeSHAP was introduced in the Nature publication by Lundberg and Lee (2020).

Can we do the same for non-tree-based models like a complex GLM or a neural network? Yes, but we have to resort to slower model-agnostic SHAP algorithms:

Read on for examples of those algorithms and an example of interpretation and analysis.

Comments closed

Multivariate Anomaly Detection with ADX

Adi Eldar shows off multivariate anomaly detection in Azure Data Explorer:

Azure Data Explorer (ADX) is commonly used for monitoring cloud resources and IoT devices performance and health. This is done by continuous collection of multiple metrics emitted by these sources, and on-going analysis of the collected data to detect anomalies. The analysis is applied over time series of the relevant metrics in order to locate significant deviations of the metrics values relative to their typical normal baseline pattern.

Click through for a nice overview of the topic, including two different scenarios: one which emphasizes time series data and the other, which does not.

Comments closed

Fun with Decision Trees

Holger von Jouanne-Diedrich explains the value of decision trees, using predictive maintenance as an example:

Predictive Maintenance is one of the big revolutions happening across all major industries right now. Instead of changing parts regularly or even only after they failed it uses Machine Learning methods to predict when a part is going to fail.

If you want to get an introduction to this fascinating developing area, read on!

Click through for an example of how it works.

Comments closed