Press "Enter" to skip to content

Category: Data Science

Solving Linear Equations in SQL Server

Sebastiao Pereira implements a function:

Solving linear equations is essential for solving real-world problems in Science, Engineering, Data Analysis, Machine Learning, Economics, Finance, and other areas. Is it possible to have a tool to solve linear equations directly in SQL Server? We will look at how to create a Gauss-Seidel method function for SQL Server.

This is one way to solve a series of linear equations, and it’s a pretty neat implementation.

Leave a Comment

Bass Product Diffusion and Data Science

John Mount does a fun analysis:

This is a graph of the percentage of Stack Overflow questions tagged with data science terms such as R, Pandas, and so on. It seems to show exploding interest in R and Pandas, and maybe even Tensorflow. Pandas was likely chosen as a proxy for interest in Python for data science (versus a general interest in Python). I’d prefer view counts over question percentages as a proxy of interest, but it is what it is.

Then I thought, let’s see if they have newer data. They do, and it is horrifying (though not unexpected to those of us in the industry).

Click through for the analysis, as well as an important note in the comments.

Comments closed

Inflation in Medieval China

Richard Vale digs into a dataset:

In this post, I would like to draw attention to a very interesting data set collected by Guan, Palma and Wu as part of the replication package for their paper The rise and fall of paper money in Yuan China, 1260-1368. The paper describes inflation, money and prices during the Yuan Dynasty era in China.

First, a little historical background.

Read on for the analysis. H/T R-Bloggers.

Comments closed

An Overview of HyperLogLog

Bhala Ranganathan talks about a powerful algorithm:

Cardinality is the number of distinct items in a dataset. Whether it’s counting the number of unique users on a website or estimating the number of distinct search queries, estimating cardinality becomes challenging when dealing with massive datasets. That’s where the HyperLogLog algorithm comes into the picture. In this article, we will explore the key concepts behind HyperLogLog and its applications.

HyperLogLog is the algorithm that SQL Server users in the APPROX_COUNT_DISTINCT() function to make it so much faster than a regular COUNT(DISTINCT) while still providing correctness guarantees within a fixed percentage error: they guarantee a 2% or lower error rate with a 97% probability.

Comments closed

Prevalence Adjustment in Binary Classifiers

David Lindelöf deal with an issue in analyzing classification models:

When you run a binary classifier over a population you get an estimate of the proportion of true positives in that population. This is known as the prevalence.

But that estimate is biased, because no classifier is perfect. 

Read on to learn what this means for precision, as well as one technique for tracking prevalence changes over itme.

Comments closed

The Power of One Data Point

I have a new video:

In this video, I demonstrate how much information we can gain from one sample of a distribution.

Some aspect of this is “that’s a neat parlor trick” but it does speak to the marginal information gain of a small amount of data.

Comments closed

Calculating Inter-Quartile Range and Z Score in T-SQL

Sebastiao Pereira hunts for outliers:

Outliers can significantly distort statistical analysis and lead to incorrect conclusions when interpreting data. In this article, we will look at how to find outliers in SQL Server using various T-SQL queries. Understanding how to find outliers in SQL is crucial for accurate data analysis.

Sebastiao uses PERCENTILE_CONT() in this demonstration. That works fine for relatively small tables, though it does not scale well at all. Once you’re in the millions of records, it gets slow. From there, my joke is that, if you have 100 million or more records, you can start a query with PERCENTILE_CONT() on one instance. Meanwhile, on a separate instance, as soon as you kick off that query, go install SQL Server ML Services, configure it, check out a tutorial on R or Python, figure out how you can calculate the inter-quartile range in that language, learn how ML Services works, and you’ll still get the answer before your first query finishes.

If you’re using SQL Server 2022, there is a new APPROX_PERCENTILE_CONT() that is orders of magnitude faster as you get increasingly large datasets. It’s also accurate to within 1.33% (on each side of the correct answer) within a 99% confidence. The way the query works is a bit different, though, because the approximation is a nested set function using a WITHIN GROUP() clause, whereas PERCENTILE_CONT() is a window function that uses an OVER() clause. That means it’s not quite as easy as slapping “APPROX_” to the start of the query, but because Sebastiao uses WITHIN GROUP in the T-SQL, it’s pretty close: PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY [ObsValue]) OVER() AS Q1 becomes APPROX_PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY [ObsValue]) AS Q1 or something like that–I’m compiling in the browser here.

Comments closed