Press "Enter" to skip to content

Category: Data Science

The Monty Hall Problem

I have a new video:

In this video, I explain the classic Monty Hall problem, based on the concept of the show Let’s Make a Deal. I explain the paradox behind the problem and demonstrate that it’s better to switch doors.

I’m not joking at all when I say it took me years of listening to explanations before it actually clicked. Some of it is my innate stubbornness, but I think this is a great example of a true paradox, where the intuitive answer is wrong and first-level reasoning also leads you astray.

Leave a Comment

Data Splitting and Cross-Validation in R

Nick Han has a pair of articles. First up is on data splitting and pre-processing:

Data preprocessing is a crucial step in any machine learning workflow. It ensures that your data is clean, consistent, and ready for modeling. In this blog post, we’ll walk through the process of splitting and preprocessing data in R, using the rsample package for data splitting and saving the results for future use.

H/T R-Bloggers for that one.

The second involves using cross-validation via the caret package in R:

Cross-validation is a resampling technique used to assess the performance and generalizability of machine learning models. It helps address issues like overfitting and ensures that the model’s performance is consistent across different subsets of the data. By splitting the data into multiple folds and repeating the process, cross-validation provides a robust estimate of model performance.

H/T R-Bloggers for that as well.

Leave a Comment

The Basics of Bayes’ Theorem

I have a new video:

In this video, I provide an introduction to Bayes’ theorem, explaining the key concepts and terms, as well as solving a totally realistic problem via Bayesian analysis.

My goal in this video was to explain a counter-intuitive phenomenon: how much a positive test or piece of information moves the needle depends primarily on how frequently the event normally happens and how frequently we generate false positives in tests. The less common the scenario, the less your positive test actually moves the needle.

Leave a Comment

Converting a CSV to Parquet with DuckDB and Polars in R

Michael Mayer makes a swap:

In this recent post, we have used Polars and DuckDB to convert a large CSV file to Parquet in steaming mode – and Python.

Different people have contacted me and asked: “and in R?”

Simple answer: We have DuckDB, and we have different Polars bindings. Here, we are using {polars} which is currently being overhauled into {neopandas}.

Click through for the comparison.

Leave a Comment

Calculating a Matrix Inversion in SQL Server

Sebastiao Pereira performs matrix math in-database:

There are numerous applications to obtain a Matrix inverse for a given Matrix. Is it possible to do it using only SQL Server? Read on to learn how to build a matrix inverse calculator using a set of SQL Server custom functions.

I expect this to be extremely slow in comparison to GPU-based methods using a language like C, but this approach maximizes style points.

Leave a Comment

Solving Linear Equations in SQL Server

Sebastiao Pereira implements a function:

Solving linear equations is essential for solving real-world problems in Science, Engineering, Data Analysis, Machine Learning, Economics, Finance, and other areas. Is it possible to have a tool to solve linear equations directly in SQL Server? We will look at how to create a Gauss-Seidel method function for SQL Server.

This is one way to solve a series of linear equations, and it’s a pretty neat implementation.

Comments closed

Bass Product Diffusion and Data Science

John Mount does a fun analysis:

This is a graph of the percentage of Stack Overflow questions tagged with data science terms such as R, Pandas, and so on. It seems to show exploding interest in R and Pandas, and maybe even Tensorflow. Pandas was likely chosen as a proxy for interest in Python for data science (versus a general interest in Python). I’d prefer view counts over question percentages as a proxy of interest, but it is what it is.

Then I thought, let’s see if they have newer data. They do, and it is horrifying (though not unexpected to those of us in the industry).

Click through for the analysis, as well as an important note in the comments.

Comments closed

Inflation in Medieval China

Richard Vale digs into a dataset:

In this post, I would like to draw attention to a very interesting data set collected by Guan, Palma and Wu as part of the replication package for their paper The rise and fall of paper money in Yuan China, 1260-1368. The paper describes inflation, money and prices during the Yuan Dynasty era in China.

First, a little historical background.

Read on for the analysis. H/T R-Bloggers.

Comments closed