Gathering Punctuation With tidytext

Kevin Feasel



Julia Silge uses the tidytext package to compare works of literature in terms of punctuation usage:

Commas are the PUNCTUATION WINNER, except in Anne of Green Gables and Ulysses, where periods win out. These two novels are dramatically different from each other in other ways, though, and Ulysses is an outlier overall with almost no spoken dialogue via quotation marks and an unusual use of colons to semicolons. Exclamation marks are used relatively more in Wuthering Heights and Alice in Wonderland!

Exploring text in these kinds of ways is so fun, and tools for this type of text mining are developing so fast. You can incorporate information like this into modeling or statistical analysis; Mike Kearney has a package called textfeatures that lets you directly extract info such as the number of commas or number of exclamation marks from text. Let me know if you have any questions!

Yet more proof that Ulysses was an awful book.

Related Posts

Inline Operators In R With wrapr

John Mount shows how to use inline operators in R with the wrapr package: The above code is assuming you have the wrapr package attached via already having run library('wrapr'). Notice we picked R-related operator names. We stayed away from overloading the + operator, as the arithmetic operators are somewhat special in how they dispatch in R. The goal wasn’t […]

Read More

Feature And Text Classification Using Naive Bayes In R

I wrap up my series on the Naive Bayes class of algorithms, finally writing some code along the way: Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, […]

Read More


July 2018
« Jun Aug »