Gathering Punctuation With tidytext

Kevin Feasel

2018-07-02

R

Julia Silge uses the tidytext package to compare works of literature in terms of punctuation usage:

Commas are the PUNCTUATION WINNER, except in Anne of Green Gables and Ulysses, where periods win out. These two novels are dramatically different from each other in other ways, though, and Ulysses is an outlier overall with almost no spoken dialogue via quotation marks and an unusual use of colons to semicolons. Exclamation marks are used relatively more in Wuthering Heights and Alice in Wonderland!

Exploring text in these kinds of ways is so fun, and tools for this type of text mining are developing so fast. You can incorporate information like this into modeling or statistical analysis; Mike Kearney has a package called textfeatures that lets you directly extract info such as the number of commas or number of exclamation marks from text. Let me know if you have any questions!

Yet more proof that Ulysses was an awful book.

Related Posts

Connecting To Elasticsearch With R

Jerod Johnson has a sample of connecting to Elasticsearch with R: You will need the following information to connect to Elasticsearch as a JDBC data source: Driver Class: Set this to cdata.jdbc.elasticsearch.ElasticsearchDriver. Classpath: Set this to the location of the driver JAR. By default, this is the lib subfolder of the installation folder. The DBI functions, […]

Read More

Voice Control For Shiny Apps

Over at Jumping Rivers, an example of using a Javascript library to control a page using voice commands: I have found that performance across all devices and browsers is definitely not equal. By far the best browser I have found for viewing the apps is Google Chrome. I have also tended to find that my […]

Read More

Categories

July 2018
MTWTFSS
« Jun Aug »
 1
2345678
9101112131415
16171819202122
23242526272829
3031