Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, which is under 3MB zipped in 2000 reviews in total.
Unike last time, I’m going to break this out into sections with commentary in between. If you want the full script with notebook, check out the GitHub repo I put together for this talk.
Assuming I ever get a chance to do this talk again, I’m probably going to change the data sets in the example given how overplayed iris is.
Step two is, on the surface, pretty tough: how do we figure out if a set of words is a business phrase or a baseball phrase? We could try to think up a set of features. For example, how long is the phrase? How many unique words does it have? Is there a pile of sunflower seeds near the phrase? But there’s an easier way.
Remember the “naive” part of Naive Bayes: all features are independent. And in this case, we can use as features the individual words. Therefore, the probability of a word being a baseball-related word or a business-related word is what matters, and we cross-multiply those probabilities to determine if the overall phrase is a baseball phrase or a business phrase.
Click through for a sports-heavy example and a bonus Nate Barkerson reference.
What is Machine Learning (ML), and how does it differ from Statistics (and hence, implicitly, from Econometrics)?
Those are big questions, but I think that they’re ones that econometricians should be thinking about. And if I were starting out in Econometrics today, I’d take a long, hard look at what’s going on in ML.
Click through for some quick thoughts and several resources on the topic.
Trust the Process
There are three steps to the process of solving the simplest of Naive Bayes algorithms. They are:
1. Find the probability of winning a game (that is, our prior probability).
2. Find the probability of winning given each input variable: whether Josh Allen starts the game, whether the team is home or away, whether the team scores 14 points, and who the top receiver was.
3. Plug in values from our new data into the formula to obtain the posterior probability.
This is an algorithm you want to solve by hand first—it’s just that easy. Then, once you understand it, let a computer do the work for larger data sets. Also, Super Bowl 2020 because I’m the kind of overly optimistic fool required of Bills fans. Just gonna leave this link here.
Why Should We Use Naive Bayes? Is It the Best Classifier Out There?
Probably not, no. In fact, it’s typically a mediocre classifier—it’s the one you strive to beat with your fancy algorithm. So why even care about this one?
Because it’s fast, easy to understand, and it works reasonably well. In other words, this is the classifier you start with to figure out if it’s worth investing your time on a problem. If you need to hit 90% category accuracy and Naive Bayes is giving you 70%, you’re probably in good shape; if it’s giving you 20% accuracy, you might need to take another look at whether you have a viable solution given your data.
Click through to learn what day it is based on what some fictional fellow has as head covering. Also, learn what it is I actually mean when I let “update your priors” slip.
Last month, I delivered the one-day workshop Practical AI for the Working Software Engineer at the Artificial Intelligence Live conference in Orlando. As the title suggests, the workshop was aimed at developers, bu I didn’t assume any particular programming language background. In addition to the lecture slides, the workshop was delivered as a series of Jupyter notebooks. I ran them using Azure Notebooks (which meant the participants had nothing to install and very little to set up), but you can run them in any Jupyter environment you like, as long as it has access to R and Python. You can download the notebooks and slides from this Github repository (and feedback is welcome there, too).
Read on for details about those notebooks and to get your own copies.
In the above we have an input (or independent variable)
xand an observed outcome (or dependent variable)
y_observed(portrayed as points).
y_observedis the unobserved idea value
y_ideal(portrayed by the dashed curve) plus independent noise. The modeling goal is to get close the
y_idealcurve using the
y_observedobservations. Obviously this can be done with a smoothing spline, but let’s use
RcppDynProgto find a piecewise linear fit.
To encode this as a dynamic programming problem we need to build a cost matrix that for every consecutive interval of
x-values we have estimated the out-of sample quality of fit. This is supplied by the function
RcppDynProg::lin_costs()(using the PRESS statistic), but lets take a quick look at the idea.
It’s an interesting package whose purpose is to turn an input data stream into a set of linear functions which approximate the stream. I’m not sure I’ll ever have a chance to use it, but it’s good to know that it’s there if I do ever need it.
Now it’s time to train our classification model! Let’s use the glmnet package to fit a logistic regression model with LASSO regularization. It’s a great fit for text classification because the variable selection that LASSO regularization performs can tell you which words are important for your prediction problem. The glmnet package also supports parallel processing with very little hassle, so we can train on multiple cores with cross-validation on the training set using
Hot take: Jane Austen was the best English-language novelist of the 19th century. I’d say “all-time” but the world isn’t ready for a take that hot.
This post came about due to a question on the Microsoft Machine Learning Server forum. The question was if there are any plans by Microsoft to support more the one input dataset (
sp_execute_external_script. My immediate reaction was that if you want more than one dataset, you can always connect from the script back into the database, and retrieve data.
However, the poster was well aware of that, but due to certain reasons he did not want to do it that way – he wanted to push in the data, fair enough. When I read this, I seemed to remember something from a while ago, where, instead of retrieving data from inside the script, they pushed in the data, serialized it as an output parameter and then used the binary representation as in input parameter (yeah – this sounds confusing, but bear with me). I did some research (read Googling), and found this StackOverflow question, and answer. So for future questions, and for me to remember, I decided to write a blog post about it.
This has been a point of frustration for me. We can name the one input data set, so I’d really like to see true support for input multiple data sets without the need for hacks.
I’ve been looking for an easy way to get to learning predictive analysis and forecasting. Prophet provides that path. Prophet is released by Facebook’s Core Data Science Team.
“Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.”
Just to dip my toes into the waters, I tried Prophet’s Quick Start Guide in R.
Let’s forecast the Field Goal Percentage (FG%) of Kyle Kuzma of the Los Angeles Lakers for the next 6 Months.
It’d be critical and important if it were hockey data. Or football data or baseball data or maybe even cricket data (but I don’t understand cricket data and why is that guy still running didn’t he get thrown out or something I don’t get it?).
As far as Prophet goes, it’s a useful library and works well if you’re looking at seasonal time series data.