Text Preprocessing With R

Sibanjan Das has started a new series on text mining in R:

Next, we need to preprocess the text to convert it into a format that can be processed for extracting information. It is essential to reduce the size of the feature space before analyzing the text. There are various preprocessing methods that we can use here, such as stop word removal, case folding, stemming, lemmatization, and contraction simplification. However, it is not necessary to apply all of the normalization methods to the text. It depends on the data we retrieve and the kind of analysis to be performed.

The series starts off with a quick description of some preprocessing steps and then building an LDA model to extract key terms from articles.

Related Posts

Testing Spatial Equilibrium Concepts With tidycensus

Ignacio Sarmiento Barbieri walks us through the concept of spatial equilibrium and tests using data from the tidycensus package: Let’s take the model to the data and reproduce figures 2.1. and 2.2 of “Cities, Agglomeration, and Spatial Equilibrium”. The focus are two cities, Chicago and Boston. These cities are chosen because both differ in how easy […]

Read More

Interacting With SQL Server From Pandas

Tomaz Kastrun shows how to use pyodbc to interact with a SQL Server database from Pandas: In the SQL Server Management Studio (SSMS), the ease of using external procedure sp_execute_external_script has been (and still will be) discussed many times. But the reason for this short blog post is the fact that, changing Python environments using Conda package/module management within Microsoft […]

Read More


November 2017
« Oct Dec »