Calculating TF-IDF Using Apache Spark

Arseniy Tashoyan shows us how to calculate Term Frequency-Inverse Document Frequency using Apache Spark:

TF-IDF is used in a large variety of applications. Typical use cases include:

  • Document search.
  • Document tagging.
  • Text preprocessing and feature vector engineering for Machine Learning algorithms.

There is a vast number of resources on the web explaining the concept itself and the calculation algorithm. This article does not repeat the information in these other Internet resources, it just illustrates TF-IDF calculation with help of Apache Spark. Emml Asimadi, in his excellent article Understanding TF-IDF, shares an approach based on the old Spark RDD and the Python language. This article, on the other hand, uses the modern Spark SQL API and Scala language.

Although Spark MLlib has an API to calculate TF-IDF, this API is not convenient to learn the concept. MLlib tools are intended to generate feature vectors for ML algorithms. There is no way to figure out the weight for a particular term in a particular document. Well, let’s make it from scratch, this will sharpen our skills.

Read on for the solution.  It seems that there tend to be better options today than TF-IDF for natural language problems, but it’s an easy algorithm to understand, so it’s useful as a first go.

Related Posts

Hortonworks Data Platform 3.0 Released

Saumitra Buragohain, et al, announce the newest version of the Hortonworks Data Platform: Highlighted Apache Hive features include: Workload management for LLAP:  You can assign resource pools within LLAP pool and allocate resources on a per user or per group basis. This enables support for large multi-tenant deployments. ACID v2 and ACID on by default:  We are […]

Read More

Testing Spatial Equilibrium Concepts With tidycensus

Ignacio Sarmiento Barbieri walks us through the concept of spatial equilibrium and tests using data from the tidycensus package: Let’s take the model to the data and reproduce figures 2.1. and 2.2 of “Cities, Agglomeration, and Spatial Equilibrium”. The focus are two cities, Chicago and Boston. These cities are chosen because both differ in how easy […]

Read More

Categories

June 2018
MTWTFSS
« May Jul »
 123
45678910
11121314151617
18192021222324
252627282930