Tokenizing Text With R

Rachael Tatman shows how to tokenize a set of text as the first step in a natural language processing experiment:

In this tutorial you’ll learn how to:

  • Read text into R
  • Select only certain lines
  • Tokenize text using the tidytext package
  • Calculate token frequency (how often each token shows up in the dataset)
  • Write reusable functions to do all of the above and make your work reproducible

For this tutorial we’ll be using a corpus of transcribed speech from bilingual children speaking in English.  You can find more information on this dataset and download it here.

It’s a nice tutorial, especially because the data set is a bit of a mess.

Related Posts

Probabilities And Poker

Steve Miller has a notebook on 5-card draw probabilities: The population of 5 card draw hands, consisting of 52 choose 5 or 2598960 elements, is pretty straightforward both mathematically and statistically. So of course ever the geek, I just had to attempt to show her how probability and statistics converge. In addition to explaining the […]

Read More

There Is No Easy Button With Predictive Analytics

Scott Mutchler dispels some myths: There are a couple of myths that I see more an more these days.  Like many myths they seem plausible on the surface but experienced data scientist know that the reality is more nuanced (and sadly requires more work). Myths: Deep learning (or Cognitive Analytics) is an easy button.  You […]

Read More

Categories

August 2017
MTWTFSS
« Jul Sep »
 123456
78910111213
14151617181920
21222324252627
28293031