Data Science – Page 9

New Video: Online Passive-Aggressive Algorithms

Published 2024-06-19 by Kevin Feasel

In this video, I cover the series of classification algorithms with the best possible name: online passive-aggressive algorithms.

I remember, when reading up on this, being incredulous that the idea even worked. But it turns out that it’s actually pretty good in practice, especially on constrained hardware. Still, this is definitely an algorithm you’d want to test in comparison to others before jumping right in, as there’s a risk you can end up with terrible results.

Comments closed

An Introduction to healthyR

Published 2024-06-14 by Kevin Feasel

Steven Sanderson covers a package:

This article will introduce you to the healthyR package. healthyR is a package that provides functions for analyzing and visualizing health-related data. It is designed to make it easier for health professionals and researchers to work with health data in R. It is an experimental package that is still under active development, so some functions may change in the future along with the package structure and scope.

Unfortunately, the package needs some love and attention. Which I am trying to give it. Given that information, I will be updating the package to include more functions and improve the existing ones. I will also be updating the documentation and adding more examples to help users get started with the package.

So let’s get started!

Read on for that overview, including an explanation of why the package exists and several examples of how to use it.

Comments closed

New Video: The Naive Bayes Set of Algorithms

Published 2024-06-12 by Kevin Feasel

I have a new video:

In this video, I cover a class of algorithm that is neither particularly naive nor particularly Bayesian: Naive Bayes.

I am a bit tongue in cheek with that description, as technically I’ll give you that the class of algorithms is “naive.” But I do still have some fun with the name and then show how we can use Naive Bayes to build a quick-and-dirty model that’s at least somewhat effective.

Comments closed

Tweedie Distributions and Generalized Linear Modeling

Published 2024-06-07 by Kevin Feasel

Christian Lorentzen talks about Tweedie distributions:

Tweedie distributions and Generalised Linear Models (GLM) have an intertwined relationship. While GLMs are, in my view, one of the best reference models for estimating expectations, Tweedie distributions lie at the heart of expectation estimation. In fact, basically all applied GLMs in practice use Tweedie distributions with three notable exceptions: the binomial, the multinomial and the negative binomial distribution.

Read on for a bit more about its history and how it ties in with several other distributions.

Comments closed

Distribution Parameter Wrangling in TidyDensity

Published 2024-06-06 by Kevin Feasel

Steven Sanderson introduces a new set of functions:

Greetings, fellow data enthusiasts! Today, we’re thrilled to unveil a fresh wave of functionalities in the ever-evolving TidyDensity package. Buckle up, as we delve into the realm of distribution statistics!

This update brings a bounty of new functions that streamline the process of extracting key parameters from various probability distributions. These functions adhere to the familiar naming convention util_distribution_name_stats_tbl(), making them easily discoverable within your R workflow.

Read on for the list and an example of how to use them.

Comments closed

Book Review of Bernoulli’s Fallacy

Published 2024-06-06 by Kevin Feasel

John Mount reviews a book:

First the conclusion: this is a well researched and important book. My rating is a strong buy, and Bernoulli’s Fallacy is already influencing how I approach my work.

My initial “judge the book by its back cover” impression of Bernoulli’s Fallacy was negative. The back cover writes some very large checks that I was initially (and wrongly) doubtful that “its fists could cash.” The thesis is that frequentist statistics (the dominant statistical practice) is far worse than is publicly admitted, and that Bayesian methods are the fix. However, other reviews and the snippets by people I respect (such as Andrew Gelman and Persi Diaconis) convinced me to buy and read the book. And I am glad that I read it. The back cover was, in my revised opinion, fully justified.

Read on for John’s full review of a book that is quite critical of frequentist statistics in favor of Bayesian statistics—so that already makes the book a winner for me.

Comments closed

An Overview of Logistic Regression

Published 2024-06-05 by Kevin Feasel

I have a new video:

In this video, I provide a primer on logistic regression, including a demystification of the name. Is it regression? Is it classification? Find out!

I have a lot of fun with this “Is logistic regression actually a regression technique, or is it secretly a classification technique?” I think this video is the single clearest explanation I’ve given on that question, which probably says something about my prior explanations.

Comments closed

Fuzzy Search and Levenshtein Distance

Published 2024-06-04 by Kevin Feasel

Hoen Nguyen explains a couple of terms:

In the world of search engines and data retrieval, achieving high accuracy and relevance in the results is a constant challenge. One of the techniques used to improve search results is Fuzzy Search.

This blog post will delve into the concept of fuzzy search, its implementation using the Levenshtein Distance, and how to test its effectiveness.

Levenshtein distance is also one of the techniques spell checkers use, comparing word not in its dictionary to other words within a certain distance.

Comments closed

Generating Data in SQL Server based on Distributions

Published 2024-06-03 by Kevin Feasel

Rick Dobson builds some data:

I support a data science team that often asks for datasets with different distribution values in uniform, normal, or lognormal shapes. Please present and demonstrate the T-SQL code for populating datasets with random values from each distribution type. I also seek graphical and statistical techniques for assessing how a random sample corresponds to a distribution type.

This is an interesting article, though if you want a set-based version of generating data according to a normal distribution, I have a blog post where I translated the RBAR version into something that performs a bit better. Converting to log-normal form also makes a lot of intuitive sense.

Comments closed

Reviewing Experimental Results in the Process

Published 2024-06-03 by Kevin Feasel

John Cook talks philosophy of statistics:

Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?

John also has a follow-up article:

Suppose you design an experiment, an A/B test of two page designs, randomizing visitors to Design A or Design B. You planned to run the test for 800 visitors and you calculated some confidence level α for your experiment.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Read on for a very interesting discussion of the topic. I’m definitely in the Bayesian camp: learn quickly update frequently, particularly early on when you have little information on the topic and the marginal value of learning one additional piece of information is so high.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Data Science