Tokenizing Text With R

Rachael Tatman shows how to tokenize a set of text as the first step in a natural language processing experiment:

In this tutorial you’ll learn how to:

  • Read text into R
  • Select only certain lines
  • Tokenize text using the tidytext package
  • Calculate token frequency (how often each token shows up in the dataset)
  • Write reusable functions to do all of the above and make your work reproducible

For this tutorial we’ll be using a corpus of transcribed speech from bilingual children speaking in English.  You can find more information on this dataset and download it here.

It’s a nice tutorial, especially because the data set is a bit of a mess.

Related Posts

P-Hacking and Multiple Comparison Bias

Patrick David has a great article on hypothesis testing, p-hacking, and multiple comparison bias: The most important part of hypothesis testing is being clear what question we are trying to answer. In our case we are asking:“Could the most extreme value happen by chance?”The most extreme value we define as the greatest absolute AMVR deviation from […]

Read More

Inline Operators In R With wrapr

John Mount shows how to use inline operators in R with the wrapr package: The above code is assuming you have the wrapr package attached via already having run library('wrapr'). Notice we picked R-related operator names. We stayed away from overloading the + operator, as the arithmetic operators are somewhat special in how they dispatch in R. The goal wasn’t […]

Read More


August 2017
« Jul Sep »