Press "Enter" to skip to content

Category: Data Science

Generating Data in SQL Server based on Distributions

Rick Dobson builds some data:

I support a data science team that often asks for datasets with different distribution values in uniform, normal, or lognormal shapes. Please present and demonstrate the T-SQL code for populating datasets with random values from each distribution type. I also seek graphical and statistical techniques for assessing how a random sample corresponds to a distribution type.

This is an interesting article, though if you want a set-based version of generating data according to a normal distribution, I have a blog post where I translated the RBAR version into something that performs a bit better. Converting to log-normal form also makes a lot of intuitive sense.

Comments closed

Reviewing Experimental Results in the Process

John Cook talks philosophy of statistics:

Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?

John also has a follow-up article:

Suppose you design an experiment, an A/B test of two page designs, randomizing visitors to Design A or Design B. You planned to run the test for 800 visitors and you calculated some confidence level α for your experiment.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Read on for a very interesting discussion of the topic. I’m definitely in the Bayesian camp: learn quickly update frequently, particularly early on when you have little information on the topic and the marginal value of learning one additional piece of information is so high.

Comments closed

Handling Imbalanced Data in Classification Algorithms

Matthew Mayo shares a few tips:

Imperfect data is the norm rather than the exception in machine learning. Comparably common is the binary class imbalance when the classes in a trained data remains majority/minority class, or is moderately skewed. Imbalanced data can undermine a machine learning model by producing model selection biases. Therefore in the interest of model performance and equitable representation, solving the problem of imbalanced data during training and evaluation is paramount.

This article will define imbalanced data, resampling strategies as solution, appropriate evaluation metrics, kinds of algorithmic approaches, and the utility of synthetic data and data augmentation to address this imbalance.

Read on for five recommendations, starting with what you should know and then offering up four options for what you can do.

Comments closed

Cross-Correlation of Time Series to Identify Time Lags in SAS

Kevin Scott and David Frede notice the pattern:

Batch manufacturing involves producing goods in batches rather than in a continuous stream. This approach is common in industries such as pharmaceuticals, chemicals, and materials processing, where precise control over the production process is essential to ensure product quality and consistency. One critical aspect of batch manufacturing is the need to manage and understand inherent time delays that occur at various stages of the process.

In the glass manufacturing industry, which operates under the principles of batch manufacturing, precisely controlling the furnace temperature is essential for producing high-quality glass. The process involves melting raw materials like silica sand, soda ash, and limestone at high temperatures, where maintaining the correct temperature is crucial.

Read on to see an example of how you can automate the identification of a time lag using cross-correlation techniques.

Comments closed

An Overview of Classification Algorithms

Matthew Mayo explains several algorithms:

Classification algorithms are at the heart of data science, helping us categorize and organize data into pre-defined classes. These algorithms are used in a wide array of applications, from spam detection and medical diagnosis to image recognition and customer profiling. It is for this reason that those new to data science must know about and understand these algorithms: they lay foundations for more advanced techniques and provide insight into how those data-driven decisions are made.

Let’s take a look at 5 essential classification algorithms, explained intuitively. We will include resources for each to learn more if interested.

Click through for five algorithms and a couple of paragraphs describing how the algorithm works. For a little bit of self-promotion on my end, I have a series on YouTube running right now on the topic of classification where I cover a variety of algorithms. As a spoiler, 4 of the 5 on Matthew’s list will have their own videos, and there are several other algorithms to boot.

Comments closed

Accuracy is Not Enough for Classification

I have a new video:

In this video, I explain why accuracy is not the be-all, end-all measure for classification. After that, I introduce the confusion matrix, a mechanism for tracking predicted versus actual values. Then, I talk about a variety of measures and how we can derive them from the confusion matrix.

The trickiest part of the confusion matrix measures is just remembering which measures comport to which combinations in the matrix. The second-trickiest part of the confusion matrix is that R and Python invert them, so reading across the top row in R is equivalent to reading down the first column in Python.

Comments closed

Gradient Boosting for Classification

I have a new video:

In this video, I take a look at an alternative to bootstrap aggregation & random forest: boosting. We cover a brief history of boosting and see how it works in action with XGBoost and LightGBM.

This is probably the video with the single largest number of links in my show notes. It’s also one of the shortest in the series; it’s funny how things work out sometimes.

Comments closed

Quick Takes on Logistic Regression

John Cook talks about my favorite form of regression that serves to solve classification problems:

Logistic regression models the probability of a yes/no event occurring. It gives you more information than a model that simply tries to classify yeses and nos. I advised a client to move from an uninterpretable classification method to logistic regression and they were so excited about the result that they filed a patent on it.

It’s too late to patent logistic regression, but they filed a patent on the application of logistic regression to their domain. I don’t know whether the patent was ever granted.

Read on for a few more thoughts on and around logistic regression and logits from a mathematician.

Comments closed