Press "Enter" to skip to content

Category: Data Science

Evaluating a Classification Model with a Spam Filter

John Mount shares an extract from Mount and Nina Zumel’s Practical Data Science with R, 2nd Edition:

This section reflects an important design decision in the book: teach model evaluation first, and as a step separate from model construction.

It is funny, but it takes some effort to teach in this way. New data scientists want to dive into the details of model construction first, and statisticians are used to getting model diagnostics as a side-effect of model fitting. However, to compare different modeling approaches one really needs good model evaluation that is independent of the model construction techniques.

Click through for that extract. I liked the first edition of the book, so I’m looking forward to the 2nd.

Comments closed

Training, Validation, and Test Data Sets with SAS Viya

Beth Ebersole takes us through creating training, validation, and test data sets using SAS Viya:

Training data are used to fit each model. Training a model involves using an algorithm to determine model parameters (e.g., weights) or other logic to map inputs (independent variables) to a target (dependent variable). Model fitting can also include input variable (feature) selection. Models are trained by minimizing an error function.

For illustration purposes, let’s say we have a very simple ordinary least squares regression model with one input (independent variable, x) and one output (dependent variable, y). Perhaps our input variable is how many hours of training a dog or cat has received, and the output variable is the combined total of how many fingers or limbs we will lose in a single encounter with the animal.

Read on for some good notes, including the difference between mean squared error and average squared error.

Comments closed

Multiple Hypothesis Testing with R

Roland Stevenson shows how we can perform multiple hypothesis tests on data, as well as potential issues:

Both results show that evaluating two tests on the same family of data will lead to a ~10% chance that a researcher will claim a “significant” result if they look for either test to reject the null. Any claim there is a maximum 5% false positive rate would be mistaken. As an exercise, verify that doing the same on \(m=4\) tests will lead to an ~18% chance!

A bad testing platform would be one that claims a maximum 5% false positive rate when any one of multiple tests on the same family of data show significance at the 5% level. Clearly, if a researcher is going to claim that the FWER is no more than \(\alpha\), then they must control for the FWER and carefully consider how individual tests reject the null.

This is worth taking some time to read carefully. H/T R-Bloggers

Comments closed

Significance, Confidence Level, and Confidence Interval

Stephanie Glen disambiguates three commonly confused but quite different terms:

In a nutshell, here are the definitions for all three.

1. Significance level: In a hypothesis test, the significance level, alpha, is the probability of making the wrong decision when the null hypothesis is true.
2. Confidence level: The probability that if a poll/test/survey were repeated over and over again, the results obtained would be the same. A confidence level =  1 – alpha. 
3. Confidence interval: A range of results from a poll, experiment, or survey that would be expected to contain the population parameter of interest. For example, an average response. Confidence intervals are constructed using significance levels / confidence levels.

Read on for several examples and more elaboration.

Comments closed

Principal Component Analysis in Python

Abhinav Choudhary shows us how to implement Principal Component Analysis in Python:

Principal Component Analysis (PCA) is an unsupervised statistical technique used to examine the interrelation among a set of variables in order to identify the underlying structure of those variables. In simple words, suppose you have 30 features column in a data frame so it will help to reduce the number of features making a new feature which is the combined effect of all the feature of the data frame. It is also known as factor analysis.

PCA is quite useful in practice, though it has the unfortunate side effect of making it harder to interpret which factors are driving your solution.

Comments closed

Fun with Residual Plots

Nina Zumel explains why, when plotting residuals, you always put predictions on the X axis and residuals on the Y axis:

One reason that the proper residual graph (for a well fit model) should smooth out to the line y=0 is known as reversion to mediocrity, or regression to the mean.

Imagine that you have an ideal process that always produces a single value y. You don’t actually observe this “true value”; instead, what you observe is y plus (IID, zero mean) noise. You can build a “model” for this process that predicts the mean of the observations, in this case the value 0.1033149. Then you can calculate the residuals of your “model” in the usual way.

This post went in a direction I wasn’t expecting, and it was all the better for it.

Comments closed

Topic Modeling

Federico Pascual has an article on topic modeling and topic classification:

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. It’s known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

Since topic modeling doesn’t require training, it’s a quick and easy way to start analyzing your data. However, you can’t guarantee you’ll receive accurate results, which is why many businesses opt to invest time training a topic classification model.

The article is long but worth the read, with examples in Python and additional notes for R.

Comments closed

A Primer on Survey Analysis

Federico Pascual has a long primer on survey analysis:

When it comes to customer feedback, you’ll find that not all the information you get is useful to your company. This feedback can be categorized into non-insightful and insightful data. The former refers to data you had already spotted as problematic, while insightful information either helps you confirm your hypotheses or notice new issues or opportunities. 

Let’s imagine your company carries out a customer satisfaction survey, and 60% of the respondents claim that the pricing of your product or service is too high. You can use that valuable data to make a decision. That’s why this data is also called actionable insights because they either lead to action, validation, or rethinking of a specific strategy you have already implemented. 

Survey design and implementation can be pretty difficult. This article does a good job pushing you away from some of the pitfalls around it.

Comments closed

Linear Regression in Power BI

Joseph Yeates shows how to implement linear regression in Power BI:

The goal of a simple linear model is to fit a line onto this plot to summarize the shape of the data using the equation above.

The “a” value is the slope of the fitted line (rise over run) and the “b” value is the intercept on the y-axis (when x is equal to zero).

In the gapminder example, the life expectancy column was assigned as the “y” variable, as it is the outcome that we are interested in predicting or understanding. The year1950 column was assigned as the “x” variable, as it is what we are using to try and measure the change in life expectancy.

This is a little more complicated than adding a regression line to a scatterplot (the “normal” way to do linear regression with Power BI) but this method lets you work with the outputs in a way that the normal method doesn’t.

Comments closed