Training A Text Classifier Against Books

Julia Silge builds a text classifier to differentiate Pride and Prejudice from War of the Worlds:

Now it’s time to train our classification model! Let’s use the glmnet package to fit a logistic regression model with LASSO regularization. It’s a great fit for text classification because the variable selection that LASSO regularization performs can tell you which words are important for your prediction problem. The glmnet package also supports parallel processing with very little hassle, so we can train on multiple cores with cross-validation on the training set using cv.glmnet().

Hot take: Jane Austen was the best English-language novelist of the 19th century. I’d say “all-time” but the world isn’t ready for a take that hot.

Related Posts

Feature And Text Classification Using Naive Bayes In R

I wrap up my series on the Naive Bayes class of algorithms, finally writing some code along the way: Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, […]

Read More

Classifying Texts With Naive Bayes

I continue my series on Naive Bayes with another hand-calculation post: Step two is, on the surface, pretty tough: how do we figure out if a set of words is a business phrase or a baseball phrase? We could try to think up a set of features. For example, how long is the phrase? How many unique […]

Read More

Categories

December 2018
MTWTFSS
« Nov Jan »
 12
3456789
10111213141516
17181920212223
24252627282930
31