Press "Enter" to skip to content

Category: Data Science

Distribution Parameter Wrangling in TidyDensity

Steven Sanderson introduces a new set of functions:

Greetings, fellow data enthusiasts! Today, we’re thrilled to unveil a fresh wave of functionalities in the ever-evolving TidyDensity package. Buckle up, as we delve into the realm of distribution statistics!

This update brings a bounty of new functions that streamline the process of extracting key parameters from various probability distributions. These functions adhere to the familiar naming convention util_distribution_name_stats_tbl(), making them easily discoverable within your R workflow.

Read on for the list and an example of how to use them.

Comments closed

Book Review of Bernoulli’s Fallacy

John Mount reviews a book:

First the conclusion: this is a well researched and important book. My rating is a strong buy, and Bernoulli’s Fallacy is already influencing how I approach my work.

My initial “judge the book by its back cover” impression of Bernoulli’s Fallacy was negative. The back cover writes some very large checks that I was initially (and wrongly) doubtful that “its fists could cash.” The thesis is that frequentist statistics (the dominant statistical practice) is far worse than is publicly admitted, and that Bayesian methods are the fix. However, other reviews and the snippets by people I respect (such as Andrew Gelman and Persi Diaconis) convinced me to buy and read the book. And I am glad that I read it. The back cover was, in my revised opinion, fully justified.

Read on for John’s full review of a book that is quite critical of frequentist statistics in favor of Bayesian statistics—so that already makes the book a winner for me.

Comments closed

An Overview of Logistic Regression

I have a new video:

In this video, I provide a primer on logistic regression, including a demystification of the name. Is it regression? Is it classification? Find out!

I have a lot of fun with this “Is logistic regression actually a regression technique, or is it secretly a classification technique?” I think this video is the single clearest explanation I’ve given on that question, which probably says something about my prior explanations.

Comments closed

Fuzzy Search and Levenshtein Distance

Hoen Nguyen explains a couple of terms:

In the world of search engines and data retrieval, achieving high accuracy and relevance in the results is a constant challenge. One of the techniques used to improve search results is Fuzzy Search.

This blog post will delve into the concept of fuzzy search, its implementation using the Levenshtein Distance, and how to test its effectiveness.

Levenshtein distance is also one of the techniques spell checkers use, comparing word not in its dictionary to other words within a certain distance.

Comments closed

Generating Data in SQL Server based on Distributions

Rick Dobson builds some data:

I support a data science team that often asks for datasets with different distribution values in uniform, normal, or lognormal shapes. Please present and demonstrate the T-SQL code for populating datasets with random values from each distribution type. I also seek graphical and statistical techniques for assessing how a random sample corresponds to a distribution type.

This is an interesting article, though if you want a set-based version of generating data according to a normal distribution, I have a blog post where I translated the RBAR version into something that performs a bit better. Converting to log-normal form also makes a lot of intuitive sense.

Comments closed

Reviewing Experimental Results in the Process

John Cook talks philosophy of statistics:

Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?

John also has a follow-up article:

Suppose you design an experiment, an A/B test of two page designs, randomizing visitors to Design A or Design B. You planned to run the test for 800 visitors and you calculated some confidence level α for your experiment.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Read on for a very interesting discussion of the topic. I’m definitely in the Bayesian camp: learn quickly update frequently, particularly early on when you have little information on the topic and the marginal value of learning one additional piece of information is so high.

Comments closed

Handling Imbalanced Data in Classification Algorithms

Matthew Mayo shares a few tips:

Imperfect data is the norm rather than the exception in machine learning. Comparably common is the binary class imbalance when the classes in a trained data remains majority/minority class, or is moderately skewed. Imbalanced data can undermine a machine learning model by producing model selection biases. Therefore in the interest of model performance and equitable representation, solving the problem of imbalanced data during training and evaluation is paramount.

This article will define imbalanced data, resampling strategies as solution, appropriate evaluation metrics, kinds of algorithmic approaches, and the utility of synthetic data and data augmentation to address this imbalance.

Read on for five recommendations, starting with what you should know and then offering up four options for what you can do.

Comments closed

Cross-Correlation of Time Series to Identify Time Lags in SAS

Kevin Scott and David Frede notice the pattern:

Batch manufacturing involves producing goods in batches rather than in a continuous stream. This approach is common in industries such as pharmaceuticals, chemicals, and materials processing, where precise control over the production process is essential to ensure product quality and consistency. One critical aspect of batch manufacturing is the need to manage and understand inherent time delays that occur at various stages of the process.

In the glass manufacturing industry, which operates under the principles of batch manufacturing, precisely controlling the furnace temperature is essential for producing high-quality glass. The process involves melting raw materials like silica sand, soda ash, and limestone at high temperatures, where maintaining the correct temperature is crucial.

Read on to see an example of how you can automate the identification of a time lag using cross-correlation techniques.

Comments closed