Data Science – Page 8

Inflation in Medieval China

Published 2025-02-10 by Kevin Feasel

In this post, I would like to draw attention to a very interesting data set collected by Guan, Palma and Wu as part of the replication package for their paper The rise and fall of paper money in Yuan China, 1260-1368. The paper describes inflation, money and prices during the Yuan Dynasty era in China.

First, a little historical background.

Read on for the analysis. H/T R-Bloggers.

Comments closed

Running a Regression Analysis in T-SQL

Published 2025-02-10 by Kevin Feasel

Sebastiao Pereira performs an analysis:

Regression analysis is a statistical technique to identify relationships between a dependent variable (outcome) and one or more independent variables (features). Is it possible to do this in SQL Server without using any external tools?

Read on for the answer. I’d file this under “neat but probably not something I’d want to rely upon.”

Comments closed

An Overview of HyperLogLog

Published 2025-01-27 by Kevin Feasel

Bhala Ranganathan talks about a powerful algorithm:

Cardinality is the number of distinct items in a dataset. Whether it’s counting the number of unique users on a website or estimating the number of distinct search queries, estimating cardinality becomes challenging when dealing with massive datasets. That’s where the HyperLogLog algorithm comes into the picture. In this article, we will explore the key concepts behind HyperLogLog and its applications.

HyperLogLog is the algorithm that SQL Server users in the APPROX_COUNT_DISTINCT() function to make it so much faster than a regular COUNT(DISTINCT) while still providing correctness guarantees within a fixed percentage error: they guarantee a 2% or lower error rate with a 97% probability.

Comments closed

The Distribution of P-Values under the Null Hypothesis

Published 2025-01-24 by Kevin Feasel

David Lindelöf asks a question:

I sometimes use this fun interview question for aspiring data scientists:

How are p-values distributed assuming the null hypothesis is true?

Read on for four incorrect answers, followed by the actual answer. I actually got the answer right, but in the sloppiest and laziest way. But hey, it was still the right answer. H/T R-Bloggers.

Comments closed

Prevalence Adjustment in Binary Classifiers

Published 2025-01-09 by Kevin Feasel

David Lindelöf deal with an issue in analyzing classification models:

When you run a binary classifier over a population you get an estimate of the proportion of true positives in that population. This is known as the prevalence.

But that estimate is biased, because no classifier is perfect.

Read on to learn what this means for precision, as well as one technique for tracking prevalence changes over itme.

Comments closed

The Power of One Data Point

Published 2025-01-08 by Kevin Feasel

I have a new video:

In this video, I demonstrate how much information we can gain from one sample of a distribution.

Some aspect of this is “that’s a neat parlor trick” but it does speak to the marginal information gain of a small amount of data.

Comments closed

Calculating Inter-Quartile Range and Z Score in T-SQL

Published 2025-01-07 by Kevin Feasel

Sebastiao Pereira hunts for outliers:

Outliers can significantly distort statistical analysis and lead to incorrect conclusions when interpreting data. In this article, we will look at how to find outliers in SQL Server using various T-SQL queries. Understanding how to find outliers in SQL is crucial for accurate data analysis.

Sebastiao uses PERCENTILE_CONT() in this demonstration. That works fine for relatively small tables, though it does not scale well at all. Once you’re in the millions of records, it gets slow. From there, my joke is that, if you have 100 million or more records, you can start a query with PERCENTILE_CONT() on one instance. Meanwhile, on a separate instance, as soon as you kick off that query, go install SQL Server ML Services, configure it, check out a tutorial on R or Python, figure out how you can calculate the inter-quartile range in that language, learn how ML Services works, and you’ll still get the answer before your first query finishes.

If you’re using SQL Server 2022, there is a new APPROX_PERCENTILE_CONT() that is orders of magnitude faster as you get increasingly large datasets. It’s also accurate to within 1.33% (on each side of the correct answer) within a 99% confidence. The way the query works is a bit different, though, because the approximation is a nested set function using a WITHIN GROUP() clause, whereas PERCENTILE_CONT() is a window function that uses an OVER() clause. That means it’s not quite as easy as slapping “APPROX_” to the start of the query, but because Sebastiao uses WITHIN GROUP in the T-SQL, it’s pretty close: PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY [ObsValue]) OVER() AS Q1 becomes APPROX_PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY [ObsValue]) AS Q1 or something like that–I’m compiling in the browser here.

Comments closed

Using complete.cases in R

Published 2024-12-20 by Kevin Feasel

Steven Sanderson has no time for missing data:

Data analysis in R often involves dealing with missing values, which can significantly impact the quality of your results. The complete.cases function in R is an essential tool for handling missing data effectively. This comprehensive guide will walk you through everything you need to know about using complete.cases in R, from basic concepts to advanced applications.

Using complete.cases to find observations with missing values is great. Using it to eliminate observations with missing values can sometimes be helpful, depending on just how many missing values you have.

Comments closed

Cosine Similarity in Power Query

Published 2024-12-19 by Kevin Feasel

John Kerski searches for similar sets:

I’ll admit upfront—I am not a data scientist by trade. Instead, I’ve picked up my data science skills over time, learning through a combination of osmosis from talented colleagues and tackling real-world data challenges. It’s been a journey of trial, error, and refinement, as I’ve worked to bridge gaps between complex data science techniques and tools available to me.

Recently, my skills were put to the test when I needed to compare hundreds of Active Directory and SharePoint Groups to find similarities in their memberships. With only Power Query available in the production environment, no Python or R to ease the process, I faced the task of finding a method to finding similarities from scratch in Power Query. In this guide, I’ll walk you through the solution I developed, highlighting the steps that made it possible.

John came up with a very clever solution. By the way, the way I like to explain cosine similarity (as a concept, not the algorithm itself) is as follows.

Back in high school physics, you probably drew vectors and learned that vectors have a direction and a magnitude (length). We drew vectors in two-dimensional space because that’s easy: it’s a line on a sheet of paper and there’s an arrow at the end to denote the direction of that vector. Conceptually, vectors with more than two dimensions behave exactly the same; the difference is that we cannot simply draw them, especially once we get past three-dimensional space (a vector with three elements). But the concept is still there: every vector has a direction and a magnitude.

We use cosine similarity to compare two vectors and see how close those two vectors are in terms of angle (direction), with the idea being that magnitude isn’t as important as angle for determining vector similarity. This is in contrast to another technique like Euclidean distance, which focuses more on the magnitude of the vectors versus angle.

Comments closed

Building and Deploying a Streamlit Data App

Published 2024-12-17 by Kevin Feasel

Ivan Palomares Carrascosa deploys an app:

This article will navigate you through the deployment of a simple machine learning (ML) for regression using Streamlit. This novel platform streamlines and simplifies deploying artifacts like ML systems as Web services.

I’ll leave aside my aside that linear regression isn’t machine learning. Click through to see how you can build a simple application in approximately 60 lines of code. This example shows off some of the simplicity in Streamlit’s design.

Comments closed

Category: Data Science