Text Preprocessing With R

Sibanjan Das has started a new series on text mining in R:

Next, we need to preprocess the text to convert it into a format that can be processed for extracting information. It is essential to reduce the size of the feature space before analyzing the text. There are various preprocessing methods that we can use here, such as stop word removal, case folding, stemming, lemmatization, and contraction simplification. However, it is not necessary to apply all of the normalization methods to the text. It depends on the data we retrieve and the kind of analysis to be performed.

The series starts off with a quick description of some preprocessing steps and then building an LDA model to extract key terms from articles.

Related Posts

Data Science And Data Engineering In HDP 3.0

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers: We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different […]

Read More

Using wrapr For A Consistent Pipe With ggplot2

John Mount shows how you can use the wrapr pipe to perform data processing and building a ggplot2 visual: Now we can run a single pipeline that combines data processing steps and ggplot plot construction. data.frame(x = 1:20) %.>% mutate(., y = cos(3*x)) %.>% ggplot(., aes(x = x, y = y)) %.>% geom_point() %.>% geom_line() %.>% ggtitle("piped ggplot2") Check […]

Read More


November 2017
« Oct Dec »