Press "Enter" to skip to content

Category: Data Science

Applying Quality Assurance Practices to Data Science

Devin Partida bridges the gap:

The world runs on data. Data scientists organize and make sense of a barrage of information, synthesizing and translating it so people can understand it. They drive the innovation and decision-making process for many organizations. But the quality of the data they use can greatly influence the accuracy of their findings, which directly impacts business outcomes and operations. That’s why data scientists must follow strong quality assurance practices.

Read on for seven practices which can help data scientists achieve better outcomes.

Comments closed

Thoughts on Linear Regression

John Mount shares some thoughts:

I want to spend some time thinking out loud about linear regression.

As a data science consultant and teacher I spend a lot of time using linear regression and teaching linear regression. I have found each of these pursuits can degenerate into mere doctrine or instructions. “do this,” “expect this,” “don’t do that,” “you should know,” and so on. What I want to do here is take a step back and think out loud about linear regression from first principles. To do attempt this I am going to start with the problem linear regression solves, and try to delay getting to the things so important that “everybody should known them without question.” So let’s think about a few things in a particular order.

For thinking out loud, this is laid out rather well, so give it a read.

Comments closed

The Story behind Benford’s Law

John Cook gives us a dose of history and math:

In 1881, astronomer Simon Newcomb noticed something curious. The first pages in books of logarithms were dirty on the edge, while the pages became progressively cleaner in later pages. He inferred from this that people more often looked up the logarithms of numbers with small leading digits than with large leading digits.

Why might this be? One might reasonably expect the numbers that came up in work to be uniformly distributed. But as often the case, it helps to ask “Uniform on what scale?”

Read on for a bit more of the story behind Newcomb’s Benford’s law and a just-so story about differing bases.

Comments closed

Generating Nested Time Series Models

Steven Sanderson can’t stop at just one time series:

There are many approaches to modeling time series data in R. One of the types of data that we might come across is a nested time series. This means the data is grouped simply by one or more keys. There are many methods in which to accomplish this task. This will be a quick post, but if you want a longer more detailed and quite frankly well written out one, then this is a really good article

The quick post doesn’t include a lot of commentary but does show the code you’d use for the operation.

Comments closed

Calculating Log Likelihood Ratios with jeva

Peter M.B. Cahusac takes us through a jamovi package:

Ever wanted to try doing an evidential analysis? You may have found it difficult to find a statistical platform to do it. Now there is the jamovi module jeva which can provide log likelihood ratios for a range of common statistical tests.

Imagine for a moment that we wish to carry out a statistical test on our sample of data. We do not want to know whether the procedure we routinely use gives us the correct answer with a specified error rate (such as the Type I error) – the frequentist approach. Nor do we want to concern ourselves with possible a priori probabilities of hypotheses being true – the Bayesian approach. We need to know whether a statistic from this particular set of data is consistent with one or more hypothetical values. Also, let’s say that we weren’t happy with how much data we had collected (a familiar problem?), and just added more when convenient. Welcome to the likelihood (or evidential) approach!

Read on for an explanation and how to try jeva out.

Comments closed

Estimating Quantiles in Python

Christian Lorentzen digs into quantile calculation:

Applied statistics is dominated by the ubiquitous mean. For a change, this post is dedicated to quantiles. I will give my best to provide a good mix of theory and practical examples.

While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They appear in box-plots, in childrens’ weight-for-age curves, in salary survey results, in risk measures like the value-at-risk in the EU-wide solvency II framework for insurance companies, in quality control and in many more fields.

There are easy functions to calculate quantiles in R and Python; this post serves as a way of understanding the variety of quantile functions available and how they can affect results with small sample sizes.

Comments closed

A Primer on Stan

Jack Kennedy explains the concepts of Stan and JAGS:

You may have used a probabilistic programming language (PPL) in the past, such as BUGS, to perform Bayesian inference. You’ve heard about Stan and want to learn a little more. Or maybe you’re about to step into the Bayesian paradigm and don’t know where to start. You want to know whether you should make the switch from JAGS to Stan, or you’ve used neither of JAGS or Stan and want to know which will suit you best. This post will focus solely on the differences between JAGS and Stan as I have experience with both of them, but there are many more PPLs out there. For example, I have never used Bean Machine, but of all the PPLs, it certainly takes the crown for best name.

Stan has been on my to-learn list for a while and I did successfully get one of my employees (a rassa-frassin’ frequentist) to use and enjoy the power of Bayesian analysis. One of these days, I’ll have to get back to it.

Comments closed

Approximation with the Mediant

John Cook didn’t make a typo:

Suppose you are trying to approximate some number x and you’ve got it sandwiched between two rational numbers:

a/b < x < c/d.

Now you’d like a better approximation. What would you do?

The obvious approach would be to take the average of a/b and c/d. That’s fine, except it could be a fair amount of work if you’re doing this in your head.

Read on for a separate approach taking the mediant (not median) of the two fractions.

Comments closed

Q&A on Data Engineering

Dustin Vannoy talks to the mirror:

An aspiring data engineer recently reached out to me for some guidance on pivoting into the field from a software development background. The questions they asked are similar to what others have asked me in the past, so I decided to capture my responses here. I link to prior posts and other resources when possible to try and keep the responses brief. These are informal thoughts of mine, not something I have sat down to rethink and research for new ideas beyond what is already in my head.

Dustin is one of the best people to talk to about data engineering. Click through for his advice.

Comments closed

K-Fold Cross-Validation in Python

Shanthababu Pandian gives us a primer on k-fold cross-validation:

In each set (fold) training and the test would be performed precisely once during this entire process. It helps us to avoid overfitting. As we know when a model is trained using all of the data in a single shot and gives the best performance accuracy. Resisting this k-fold cross-validation helps us to build the model as a generalized one.

To achieve this K-Fold Cross Validation, we have to split the data set into three sets, Training, Testing, and Validation, with the challenge of the volume of the data.

Read on for the explanation and an example.

Comments closed