Press "Enter" to skip to content

Category: Data Science

Comparing Poisson Regression to Regressing Against Logs

Nina Zumel compares a pair of methods for performing regression when income is the dependent variable:

Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.

Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.

This is an interesting post with a great teaser for the next post in the series.

Comments closed

tidylo: Calculating Log Odds in R

Julia Silge announces a new package, tidylo:

The package contains examples in the README and vignette, but let’s walk though another, different example here. This weighted log odds approach is useful for text analysis, but not only for text analysis. In the weeks since we’ve had this package up and running, I’ve found myself reaching for it in multiple situations, both text and not, in my real-life day job. For this example, let’s look at the same data as my last post, names given to children in the US.

Which names were most common in the 1950s, 1960s, 1970s, and 1980?

This package looks like it’s worth checking out if you deal with frequency-based problems.

Comments closed

Random Forest on Small Numbers of Observations

Neil Saunders takes us through an interesting problem:

A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient?

The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem.

Neil decided to explore the problem further regardless and came to some interesting conclusions.

Comments closed

Lasso and Ridge Regression in Python

Kristian Larsen shows off a few regression techniques using Python:

Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Therefore, when you conduct a regression model it can be helpful to do a lasso regression in order to predict how many variables your model should contain. This secures that your model is not overly complex and prevents the model from over-fitting which can result in a biased and inefficient model.

Read on for demonstrations.

Comments closed

Sales Predictions with Pandas

Megan Quinn shows how you can use Pandas and linear regression to predict sales figures:

Pandas is an open-source Python package that provides users with high-performing and flexible data structures. These structures are designed to make analyzing relational or labeled data both easy and intuitive. Pandas is one of the most popular and quintessential tools leveraged by data scientists when developing a machine learning model. The most crucial step in the machine learning process is not simply fitting a model to a given data set. Most of the model development process takes place in the pre-processing and data exploration phase. An accurate model requires good predictors and, in order to acquire them, the user must understand the raw data. Through Pandas’ numerous data wrangling and analysis tools, this important step can easily be achieved. The goal of this blog is to highlight some of the central and most commonly used tools in Pandas while illustrating their significance in model development. The data set used for this demo consists of a supermarket chain’s sales across multiple stores in a variety of cities. The sales data is broken down by items within the stores. The goal is to predict a certain item’s sale.

Click through for an example of the process, including data cleansing and feature extraction, data analysis, and modeling.

Comments closed

Linear Regression Assumptions

Stephanie Glen has a chart which explains the four key assumptions behind when Ordinary Least Squares is the Best Linear Unbiased Estimator:

If any of the main assumptions of linear regression are violated, any results or forecasts that you glean from your data will be extremely biased, inefficient or misleading. Navigating all of the different assumptions and recommendations to identify the assumption can be overwhelming (for example, normality has more than half a dozen options for testing).

Violating one of the assumptions isn’t the end of the world, though it can make understanding the model and generating accurate predictions harder.

Comments closed

Monte Carlo Simulation in Python

Kristian Larsen has a couple of posts on Monte Carlo style simulation in Python. First up is a post which covers how to generate data from different distributions:

One method that is very useful for data scientist/data analysts in order to validate methods or data is Monte Carlo simulation. In this article, you learn how to do a Monte Carlo simulation in Python. Furthermore, you learn how to make different Statistical probability distributions in Python.

You can also bootstrap your data, reusing data points when building a set of samples:

A useful method for data scientists/data analysts in order to validate methods or data is Bootstrap with Monte Carlo simulation In this article, you learn how to do a Bootstrap with Monte Carlo simulation in Python.

Both posts are worth the read.

Comments closed

An Example of p-Hacking

Vincent Granville explains why using p-values for model-worthiness can lead you to a bad outcome:

Recently, p-values have been criticized and even banned by some journals, because they are used by researchers, who cherry-pick observations and repeat experiments until they obtain a p-value worth publishing to obtain grant money, get tenure, or for political reasons.  Even the American Statistical Association wrote a long article about why to avoid p-values, and what you should do instead: see here.  For data scientists, obvious alternatives include re-sampling techniques: see here and here. One advantage is that they are model-independent, data-driven, and easy to understand. 

Here we explain how the manipulation and treachery works, using a simple simulated data set consisting of purely random, non-correlated observations. Using p-values, you can tell anything you want about the data, even the fact that the features are highly correlated, when they are not. The data set consists of 16 variables and 30 observations, generated using the RAND function in Excel. You can download the spreadsheet here.

And for a more academic treatment of the problem, I love this paper by Andrew Gelman and Eric Loken, particularly because it points out that you don’t have to have malicious intent to end up doing the wrong thing.

Comments closed

Predicting Intermittent Demand

Bruno Rodrigues shows one technique for forecasting intermittent data:

Now, it is clear that this will be tricky to forecast. There is no discernible pattern, no trend, no seasonality… nothing that would make it “easy” for a model to learn how to forecast such data.

This is typical intermittent demand data. Specific methods have been developed to forecast such data, the most well-known being Croston, as detailed in this paper. A function to estimate such models is available in the {tsintermittent} package, written by Nikolaos Kourentzes who also wrote another package, {nnfor}, which uses Neural Networks to forecast time series data. I am going to use both to try to forecast the intermittent demand for the {RDieHarder} package for the year 2019.

Read the whole thing. H/T R-Bloggers

Comments closed