Where Machine Learning And Econometrics Collide

Dave Giles shares some thoughts on how machine learning and econometrics relate:

What is Machine Learning (ML), and how does it differ from Statistics (and hence, implicitly, from Econometrics)?

Those are big questions, but I think that they’re ones that econometricians should be thinking about. And if I were starting out in Econometrics today, I’d take a long, hard look at what’s going on in ML.

Click through for some quick thoughts and several resources on the topic.

Solving Naive Bayes By Hand

I have a post that requires math and is meaner toward the Buffalo Bills than I normally am:

Trust the Process
There are three steps to the process of solving the simplest of Naive Bayes algorithms. They are:
1. Find the probability of winning a game (that is, our prior probability).
2. Find the probability of winning given each input variable: whether Josh Allen starts the game, whether the team is home or away, whether the team scores 14 points, and who the top receiver was.
3. Plug in values from our new data into the formula to obtain the posterior probability.

This is an algorithm you want to solve by hand first—it’s just that easy. Then, once you understand it, let a computer do the work for larger data sets. Also, Super Bowl 2020 because I’m the kind of overly optimistic fool required of Bills fans. Just gonna leave this link here.

The Basics Of Naive Bayes Classifiers

I have the first post in a series up on using the Naive Bayes class of algorithms for classifying inputs:

Why Should We Use Naive Bayes? Is It the Best Classifier Out There?
Probably not, no. In fact, it’s typically a mediocre classifier—it’s the one you strive to beat with your fancy algorithm. So why even care about this one?
Because it’s fast, easy to understand, and it works reasonably well. In other words, this is the classifier you start with to figure out if it’s worth investing your time on a problem. If you need to hit 90% category accuracy and Naive Bayes is giving you 70%, you’re probably in good shape; if it’s giving you 20% accuracy, you might need to take another look at whether you have a viable solution given your data.

Click through to learn what day it is based on what some fictional fellow has as head covering. Also, learn what it is I actually mean when I let “update your priors” slip.

Practical AI Workshop Notebooks

David Smith has published a set of notebooks from the Practical AI for the Working Software Engineer workshop:

Last month, I delivered the one-day workshop Practical AI for the Working Software Engineer at the Artificial Intelligence Live conference in Orlando. As the title suggests, the workshop was aimed at developers, bu I didn’t assume any particular programming language background. In addition to the lecture slides, the workshop was delivered as a series of Jupyter notebooks. I ran them using Azure Notebooks (which meant the participants had nothing to install and very little to set up), but you can run them in any Jupyter environment you like, as long as it has access to R and Python. You can download the notebooks and slides from this Github repository (and feedback is welcome there, too). 

Read on for details about those notebooks and to get your own copies.

Dynamic Programming In R With RCppDynProg

John Mount has a new package available in R:

In the above we have an input (or independent variable) x and an observed outcome (or dependent variable) y_observed (portrayed as points). y_observed is the unobserved idea value y_ideal (portrayed by the dashed curve) plus independent noise. The modeling goal is to get close the y_ideal curve using the y_observed observations. Obviously this can be done with a smoothing spline, but let’s use RcppDynProg to find a piecewise linear fit.
To encode this as a dynamic programming problem we need to build a cost matrix that for every consecutive interval of x-values we have estimated the out-of sample quality of fit. This is supplied by the function RcppDynProg::lin_costs() (using the PRESS statistic), but lets take a quick look at the idea.

It’s an interesting package whose purpose is to turn an input data stream into a set of linear functions which approximate the stream. I’m not sure I’ll ever have a chance to use it, but it’s good to know that it’s there if I do ever need it.

Training A Text Classifier Against Books

Julia Silge builds a text classifier to differentiate Pride and Prejudice from War of the Worlds:

Now it’s time to train our classification model! Let’s use the glmnet package to fit a logistic regression model with LASSO regularization. It’s a great fit for text classification because the variable selection that LASSO regularization performs can tell you which words are important for your prediction problem. The glmnet package also supports parallel processing with very little hassle, so we can train on multiple cores with cross-validation on the training set using cv.glmnet().

Hot take: Jane Austen was the best English-language novelist of the 19th century. I’d say “all-time” but the world isn’t ready for a take that hot.

Load Multiple Input Data Sets For ML Services

Niels Berglund shows us a way to get more than one input data set passed into SQL Server Machine Learning Services:

This post came about due to a question on the Microsoft Machine Learning Server forum. The question was if there are any plans by Microsoft to support more the one input dataset (@input_data_1) in sp_execute_external_script. My immediate reaction was that if you want more than one dataset, you can always connect from the script back into the database, and retrieve data.
However, the poster was well aware of that, but due to certain reasons he did not want to do it that way – he wanted to push in the data, fair enough. When I read this, I seemed to remember something from a while ago, where, instead of retrieving data from inside the script, they pushed in the data, serialized it as an output parameter and then used the binary representation as in input parameter (yeah – this sounds confusing, but bear with me). I did some research (read Googling), and found this StackOverflow question, and answer. So for future questions, and for me to remember, I decided to write a blog post about it.

This has been a point of frustration for me. We can name the one input data set, so I’d really like to see true support for input multiple data sets without the need for hacks.

Forecasting Field Goal Percentages With Prophet

Marlon Ribunal uses the Prophet library in R to forecast critical information:

I’ve been looking for an easy way to get to learning predictive analysis and forecasting. Prophet provides that path. Prophet is released by Facebook’s Core Data Science Team.
“Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.”
Just to dip my toes into the waters, I tried Prophet’s Quick Start Guide in R.
Let’s forecast the Field Goal Percentage (FG%) of Kyle Kuzma of the Los Angeles Lakers for the next 6 Months.

It’d be critical and important if it were hockey data. Or football data or baseball data or maybe even cricket data (but I don’t understand cricket data and why is that guy still running didn’t he get thrown out or something I don’t get it?).

As far as Prophet goes, it’s a useful library and works well if you’re looking at seasonal time series data.

Variable Screening With vtreat

John Mount explains how you can use vtreat for determining variable importance:

Part of the vtreat philosophy is to assume after the vtreat variable processing the next step is a sophisticated supervised machine learningmethod. Under this assumption we assume the machine learning methodology (be it regression, tree methods, random forests, boosting, or neural nets) will handle issues of redundant variables, joint distributions of variables, overall regularization, and joint dimension reduction.
However, an important exception is: variable screening. In practice we have seen wide data-warehouses with hundreds of columns overwhelm and defeat state of the art machine learning algorithms due to over-fitting. We have some synthetic examples of this (here and here).
The upshot is: even in 2018 you can not treat every column you find in a data warehouse as a variable. You must at least perform some basic screening.

Read on to see a couple quick functions which help with this screening.

Reviewing Word Associations With R

Julia Silge does some exploratory analysis on the Small World of Words project:

The Small World of Words project focuses on word associations. You can try it out for yourself to see how it works, but the general idea is that the participant is presented with a word (from “telephone” to “journalist” to “yoga”) and is then asked to give their immediate association with that word. The project has collected more than 15 million responses to date, and is still collecting data. You can check out some pre-built visualizations the researchers have put together to explore the dataset, or you can download the data for yourself.

It’s an interesting analysis of the data set, mixed in with some good R code.


January 2019
« Dec