Press "Enter" to skip to content

Category: Data Science

Fuzzy Search and Levenshtein Distance

Hoen Nguyen explains a couple of terms:

In the world of search engines and data retrieval, achieving high accuracy and relevance in the results is a constant challenge. One of the techniques used to improve search results is Fuzzy Search.

This blog post will delve into the concept of fuzzy search, its implementation using the Levenshtein Distance, and how to test its effectiveness.

Levenshtein distance is also one of the techniques spell checkers use, comparing word not in its dictionary to other words within a certain distance.

Comments closed

Generating Data in SQL Server based on Distributions

Rick Dobson builds some data:

I support a data science team that often asks for datasets with different distribution values in uniform, normal, or lognormal shapes. Please present and demonstrate the T-SQL code for populating datasets with random values from each distribution type. I also seek graphical and statistical techniques for assessing how a random sample corresponds to a distribution type.

This is an interesting article, though if you want a set-based version of generating data according to a normal distribution, I have a blog post where I translated the RBAR version into something that performs a bit better. Converting to log-normal form also makes a lot of intuitive sense.

Comments closed

Reviewing Experimental Results in the Process

John Cook talks philosophy of statistics:

Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?

John also has a follow-up article:

Suppose you design an experiment, an A/B test of two page designs, randomizing visitors to Design A or Design B. You planned to run the test for 800 visitors and you calculated some confidence level α for your experiment.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Read on for a very interesting discussion of the topic. I’m definitely in the Bayesian camp: learn quickly update frequently, particularly early on when you have little information on the topic and the marginal value of learning one additional piece of information is so high.

Comments closed

Handling Imbalanced Data in Classification Algorithms

Matthew Mayo shares a few tips:

Imperfect data is the norm rather than the exception in machine learning. Comparably common is the binary class imbalance when the classes in a trained data remains majority/minority class, or is moderately skewed. Imbalanced data can undermine a machine learning model by producing model selection biases. Therefore in the interest of model performance and equitable representation, solving the problem of imbalanced data during training and evaluation is paramount.

This article will define imbalanced data, resampling strategies as solution, appropriate evaluation metrics, kinds of algorithmic approaches, and the utility of synthetic data and data augmentation to address this imbalance.

Read on for five recommendations, starting with what you should know and then offering up four options for what you can do.

Comments closed

Cross-Correlation of Time Series to Identify Time Lags in SAS

Kevin Scott and David Frede notice the pattern:

Batch manufacturing involves producing goods in batches rather than in a continuous stream. This approach is common in industries such as pharmaceuticals, chemicals, and materials processing, where precise control over the production process is essential to ensure product quality and consistency. One critical aspect of batch manufacturing is the need to manage and understand inherent time delays that occur at various stages of the process.

In the glass manufacturing industry, which operates under the principles of batch manufacturing, precisely controlling the furnace temperature is essential for producing high-quality glass. The process involves melting raw materials like silica sand, soda ash, and limestone at high temperatures, where maintaining the correct temperature is crucial.

Read on to see an example of how you can automate the identification of a time lag using cross-correlation techniques.

Comments closed

An Overview of Classification Algorithms

Matthew Mayo explains several algorithms:

Classification algorithms are at the heart of data science, helping us categorize and organize data into pre-defined classes. These algorithms are used in a wide array of applications, from spam detection and medical diagnosis to image recognition and customer profiling. It is for this reason that those new to data science must know about and understand these algorithms: they lay foundations for more advanced techniques and provide insight into how those data-driven decisions are made.

Let’s take a look at 5 essential classification algorithms, explained intuitively. We will include resources for each to learn more if interested.

Click through for five algorithms and a couple of paragraphs describing how the algorithm works. For a little bit of self-promotion on my end, I have a series on YouTube running right now on the topic of classification where I cover a variety of algorithms. As a spoiler, 4 of the 5 on Matthew’s list will have their own videos, and there are several other algorithms to boot.

Comments closed

Accuracy is Not Enough for Classification

I have a new video:

In this video, I explain why accuracy is not the be-all, end-all measure for classification. After that, I introduce the confusion matrix, a mechanism for tracking predicted versus actual values. Then, I talk about a variety of measures and how we can derive them from the confusion matrix.

The trickiest part of the confusion matrix measures is just remembering which measures comport to which combinations in the matrix. The second-trickiest part of the confusion matrix is that R and Python invert them, so reading across the top row in R is equivalent to reading down the first column in Python.

Comments closed

Gradient Boosting for Classification

I have a new video:

In this video, I take a look at an alternative to bootstrap aggregation & random forest: boosting. We cover a brief history of boosting and see how it works in action with XGBoost and LightGBM.

This is probably the video with the single largest number of links in my show notes. It’s also one of the shortest in the series; it’s funny how things work out sometimes.

Comments closed