Tokenizing Text With R

Rachael Tatman shows how to tokenize a set of text as the first step in a natural language processing experiment:

In this tutorial you’ll learn how to:

Read text into R

Select only certain lines

Tokenize text using the tidytext package

Calculate token frequency (how often each token shows up in the dataset)

Write reusable functions to do all of the above and make your work reproducible

For this tutorial we’ll be using a corpus of transcribed speech from bilingual children speaking in English. You can find more information on this dataset and download it here.

It’s a nice tutorial, especially because the data set is a bit of a mess.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31