Data Science – Page 23

The Story behind Benford’s Law

Published 2023-03-13 by Kevin Feasel

John Cook gives us a dose of history and math:

In 1881, astronomer Simon Newcomb noticed something curious. The first pages in books of logarithms were dirty on the edge, while the pages became progressively cleaner in later pages. He inferred from this that people more often looked up the logarithms of numbers with small leading digits than with large leading digits.

Why might this be? One might reasonably expect the numbers that came up in work to be uniformly distributed. But as often the case, it helps to ask “Uniform on what scale?”

Read on for a bit more of the story behind ~~Newcomb’s~~ Benford’s law and a just-so story about differing bases.

Comments closed

Generating Nested Time Series Models

Published 2023-03-01 by Kevin Feasel

Steven Sanderson can’t stop at just one time series:

There are many approaches to modeling time series data in R. One of the types of data that we might come across is a nested time series. This means the data is grouped simply by one or more keys. There are many methods in which to accomplish this task. This will be a quick post, but if you want a longer more detailed and quite frankly well written out one, then this is a really good article

The quick post doesn’t include a lot of commentary but does show the code you’d use for the operation.

Comments closed

Calculating Log Likelihood Ratios with jeva

Published 2023-02-23 by Kevin Feasel

Peter M.B. Cahusac takes us through a jamovi package:

Ever wanted to try doing an evidential analysis? You may have found it difficult to find a statistical platform to do it. Now there is the jamovi module jeva which can provide log likelihood ratios for a range of common statistical tests.

Imagine for a moment that we wish to carry out a statistical test on our sample of data. We do not want to know whether the procedure we routinely use gives us the correct answer with a specified error rate (such as the Type I error) – the frequentist approach. Nor do we want to concern ourselves with possible a priori probabilities of hypotheses being true – the Bayesian approach. We need to know whether a statistic from this particular set of data is consistent with one or more hypothetical values. Also, let’s say that we weren’t happy with how much data we had collected (a familiar problem?), and just added more when convenient. Welcome to the likelihood (or evidential) approach!

Read on for an explanation and how to try jeva out.

Comments closed

Estimating Quantiles in Python

Published 2023-02-16 by Kevin Feasel

Christian Lorentzen digs into quantile calculation:

Applied statistics is dominated by the ubiquitous mean. For a change, this post is dedicated to quantiles. I will give my best to provide a good mix of theory and practical examples.

While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They appear in box-plots, in childrens’ weight-for-age curves, in salary survey results, in risk measures like the value-at-risk in the EU-wide solvency II framework for insurance companies, in quality control and in many more fields.

There are easy functions to calculate quantiles in R and Python; this post serves as a way of understanding the variety of quantile functions available and how they can affect results with small sample sizes.

Comments closed

A Primer on Stan

Published 2023-02-10 by Kevin Feasel

Jack Kennedy explains the concepts of Stan and JAGS:

You may have used a probabilistic programming language (PPL) in the past, such as BUGS, to perform Bayesian inference. You’ve heard about Stan and want to learn a little more. Or maybe you’re about to step into the Bayesian paradigm and don’t know where to start. You want to know whether you should make the switch from JAGS to Stan, or you’ve used neither of JAGS or Stan and want to know which will suit you best. This post will focus solely on the differences between JAGS and Stan as I have experience with both of them, but there are many more PPLs out there. For example, I have never used Bean Machine, but of all the PPLs, it certainly takes the crown for best name.

Stan has been on my to-learn list for a while and I did successfully get one of my employees (a rassa-frassin’ frequentist) to use and enjoy the power of Bayesian analysis. One of these days, I’ll have to get back to it.

Comments closed

Approximation with the Mediant

Published 2023-02-08 by Kevin Feasel

John Cook didn’t make a typo:

Suppose you are trying to approximate some number x and you’ve got it sandwiched between two rational numbers:

a/b < x < c/d.

Now you’d like a better approximation. What would you do?

The obvious approach would be to take the average of a/b and c/d. That’s fine, except it could be a fair amount of work if you’re doing this in your head.

Read on for a separate approach taking the mediant (not median) of the two fractions.

Comments closed

Q&A on Data Engineering

Published 2023-01-23 by Kevin Feasel

Dustin Vannoy talks to the mirror:

An aspiring data engineer recently reached out to me for some guidance on pivoting into the field from a software development background. The questions they asked are similar to what others have asked me in the past, so I decided to capture my responses here. I link to prior posts and other resources when possible to try and keep the responses brief. These are informal thoughts of mine, not something I have sat down to rethink and research for new ideas beyond what is already in my head.

Dustin is one of the best people to talk to about data engineering. Click through for his advice.

Comments closed

K-Fold Cross-Validation in Python

Published 2023-01-05 by Kevin Feasel

Shanthababu Pandian gives us a primer on k-fold cross-validation:

In each set (fold) training and the test would be performed precisely once during this entire process. It helps us to avoid overfitting. As we know when a model is trained using all of the data in a single shot and gives the best performance accuracy. Resisting this k-fold cross-validation helps us to build the model as a generalized one.

To achieve this K-Fold Cross Validation, we have to split the data set into three sets, Training, Testing, and Validation, with the challenge of the volume of the data.

Read on for the explanation and an example.

Comments closed

2023 Data Professional Survey Results

Published 2023-01-04 by Kevin Feasel

Brent Ozar busts out the briefcase full of Benjamins:

Are your peers being paid more this year? Are they switching job roles? Are they planning on leaving their companies? To find out, I run a salary survey every year for folks in the database industry. Download the raw data here and slice & dice ’em to see what’s important to you.

As a quick note, however, remember that inflation in the US went up considerably. Inflation wasn’t something we had to factor in from 2017-2021, as it was 1.5-2%. In 2021, it increased to more than 4% and in 2022 was closer to 8-9%, so converting these from nominal (pre-inflation) to real (post-inflation) will help tell the full story.

Comments closed

Interpreting Linear Models with SHAP

Published 2022-12-27 by Kevin Feasel

Michael Mayer answers a question:

XGBoost models are often interpreted with SHAP (Shapley Additive eXplanations): Each of e.g. 1000 randomly selected predictions is fairly decomposed into contributions of the features using the extremely fast TreeSHAP algorithm, providing a rich interpretation of the model as a whole. TreeSHAP was introduced in the Nature publication by Lundberg and Lee (2020).

Can we do the same for non-tree-based models like a complex GLM or a neural network? Yes, but we have to resort to slower model-agnostic SHAP algorithms:

Read on for examples of those algorithms and an example of interpretation and analysis.

Comments closed

Category: Data Science