I wrap up my series on the Naive Bayes class of algorithms, finally writing some code along the way:

Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, which is under 3MB zipped in 2000 reviews in total.

Unike last time, I’m going to break this out into sections with commentary in between. If you want the full script with notebook, check out the GitHub repo I put together for this talk.

Assuming I ever get a chance to do this talk again, I’m probably going to change the data sets in the example given how overplayed iris is.

I continue my series on Naive Bayes with another hand-calculation post:

Step two is, on the surface, pretty tough: how

dowe figure out if a set of words is a business phrase or a baseball phrase? We could try to think up a set of features. For example, how long is the phrase? How many unique words does it have? Is there a pile of sunflower seeds near the phrase? But there’s an easier way.Remember the “naive” part of Naive Bayes: all features are independent. And in this case, we can use as features the individual words. Therefore, the probability of a

wordbeing a baseball-related word or a business-related word is what matters, and we cross-multiply those probabilities to determine if the overall phrase is a baseball phrase or a business phrase.

Click through for a sports-heavy example and a bonus Nate Barkerson reference.

Dave Giles shares some thoughts on how machine learning and econometrics relate:

What is Machine Learning (ML), and how does it differ from Statistics (and hence, implicitly, from Econometrics)?

Those are big questions, but I think that they’re ones that econometricians should be thinking about. And if I were starting out in Econometrics today, I’d take a long, hard look at what’s going on in ML.

Click through for some quick thoughts and several resources on the topic.

I have a post that requires math and is meaner toward the Buffalo Bills than I normally am:

Trust the Process

There are three steps to the process of solving the simplest of Naive Bayes algorithms. They are:

1. Find the probability of winning a game (that is, ourprior probability).

2. Find the probability of winning given each input variable: whether Josh Allen starts the game, whether the team is home or away, whether the team scores 14 points, and who the top receiver was.

3. Plug in values from our new data into the formula to obtain theposterior probability.

This is an algorithm you want to solve by hand first—it’s just that easy. Then, once you understand it, let a computer do the work for larger data sets. Also, Super Bowl 2020 because I’m the kind of overly optimistic fool required of Bills fans. Just gonna leave this link here.

Why Should We Use Naive Bayes? Is It the Best Classifier Out There?

Probably not, no. In fact, it’s typically a mediocre classifier—it’s the one you strive to beat with your fancy algorithm. So why even care about this one?

Because it’s fast, easy to understand, and it works reasonably well. In other words, this is the classifier you start with to figure out if it’s worth investing your time on a problem. If you need to hit 90% category accuracy and Naive Bayes is giving you 70%, you’re probably in good shape; if it’s giving you 20% accuracy, you might need to take another look at whether you have a viable solution given your data.

Click through to learn what day it is based on what some fictional fellow has as head covering. Also, learn what it is I actually mean when I let “update your priors” slip.

Last month, I delivered the one-day workshop Practical AI for the Working Software Engineer at the Artificial Intelligence Live conference in Orlando. As the title suggests, the workshop was aimed at developers, bu I didn’t assume any particular programming language background. In addition to the lecture slides, the workshop was delivered as a series of Jupyter notebooks. I ran them using Azure Notebooks (which meant the participants had nothing to install and very little to set up), but you can run them in any Jupyter environment you like, as long as it has access to R and Python. You can download the notebooks and slides from this Github repository (and feedback is welcome there, too).

Read on for details about those notebooks and to get your own copies.

John Mount has a new package available in R:

In the above we have an input (or independent variable)

`x`

and an observed outcome (or dependent variable)`y_observed`

(portrayed as points).`y_observed`

is the unobserved idea value`y_ideal`

(portrayed by the dashed curve) plus independent noise. The modeling goal is to get close the`y_ideal`

curve using the`y_observed`

observations. Obviously this can be done with a smoothing spline, but let’s use`RcppDynProg`

to find a piecewise linear fit.

To encode this as a dynamic programming problem we need to build a cost matrix that for every consecutive interval of`x`

-values we have estimated the out-of sample quality of fit. This is supplied by the function`RcppDynProg::lin_costs()`

(using the PRESS statistic), but lets take a quick look at the idea.

It’s an interesting package whose purpose is to turn an input data stream into a set of linear functions which approximate the stream. I’m not sure I’ll ever have a chance to use it, but it’s good to know that it’s there if I do ever need it.

Julia Silge builds a text classifier to differentiate Pride and Prejudice from War of the Worlds:

Now it’s time to train our classification model! Let’s use the glmnet package to fit a logistic regression model with LASSO regularization. It’s a great fit for text classification because the variable selection that LASSO regularization performs can tell you which words are important for your prediction problem. The glmnet package also supports parallel processing with very little hassle, so we can train on multiple cores with cross-validation on the training set using

`cv.glmnet()`

.

Hot take: Jane Austen was the best English-language novelist of the 19th century. I’d say “all-time” but the world isn’t ready for a take that hot.

This post came about due to a question on the Microsoft Machine Learning Server forum. The question was if there are any plans by Microsoft to support more the one input dataset (

`@input_data_1`

) in`sp_execute_external_script`

. My immediate reaction was that if you want more than one dataset, you can always connect from the script back into the database, and retrieve data.

However, the poster was well aware of that, but due to certain reasons he did not want to do it that way – he wanted to push in the data, fair enough. When I read this, I seemed to remember something from a while ago, where, instead of retrieving data from inside the script, they pushed in the data, serialized it as an output parameter and then used the binary representation as in input parameter (yeah – this sounds confusing, but bear with me). I did some research (read Googling), and found this StackOverflow question, and answer. So for future questions, and for me to remember, I decided to write a blog post about it.

This has been a point of frustration for me. We can name the one input data set, so I’d really like to see true support for input multiple data sets without the need for hacks.

Marlon Ribunal uses the Prophet library in R to forecast critical information:

I’ve been looking for an easy way to get to learning predictive analysis and forecasting. Prophet provides that path. Prophet is released by Facebook’s Core Data Science Team.

“Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.”

Just to dip my toes into the waters, I tried Prophet’s Quick Start Guide in R.

Let’s forecast the Field Goal Percentage (FG%) of Kyle Kuzma of the Los Angeles Lakers for the next 6 Months.

It’d be critical and important if it were hockey data. Or football data or baseball data or maybe even cricket data (but I don’t understand cricket data and why is that guy still running didn’t he get thrown out or something I don’t get it?).

As far as Prophet goes, it’s a useful library and works well if you’re looking at seasonal time series data.

Kevin Feasel

2019-01-18

Data Science, R