Rachael Tatman shows how to tokenize a set of text as the first step in a natural language processing experiment:
In this tutorial you’ll learn how to:
- Read text into R
- Select only certain lines
- Tokenize text using the tidytext package
- Calculate token frequency (how often each token shows up in the dataset)
- Write reusable functions to do all of the above and make your work reproducible
For this tutorial we’ll be using a corpus of transcribed speech from bilingual children speaking in English. You can find more information on this dataset and download it here.
It’s a nice tutorial, especially because the data set is a bit of a mess.