Calculating TF-IDF Using Apache Spark

Arseniy Tashoyan shows us how to calculate Term Frequency-Inverse Document Frequency using Apache Spark:

TF-IDF is used in a large variety of applications. Typical use cases include:

  • Document search.
  • Document tagging.
  • Text preprocessing and feature vector engineering for Machine Learning algorithms.

There is a vast number of resources on the web explaining the concept itself and the calculation algorithm. This article does not repeat the information in these other Internet resources, it just illustrates TF-IDF calculation with help of Apache Spark. Emml Asimadi, in his excellent article Understanding TF-IDF, shares an approach based on the old Spark RDD and the Python language. This article, on the other hand, uses the modern Spark SQL API and Scala language.

Although Spark MLlib has an API to calculate TF-IDF, this API is not convenient to learn the concept. MLlib tools are intended to generate feature vectors for ML algorithms. There is no way to figure out the weight for a particular term in a particular document. Well, let’s make it from scratch, this will sharpen our skills.

Read on for the solution.  It seems that there tend to be better options today than TF-IDF for natural language problems, but it’s an easy algorithm to understand, so it’s useful as a first go.

Related Posts

Mounting HDFS As A Local Filesystem

Guy Shilo looks at two techniques for mounting HDFS as a local filesystem: NFS Gateway is a HDFS component that enables the use to expose HDFS through NFS3 interface so that Linux machines can mount it and access it just as a local filesystem. The manual installation is quite cumbersome and is covered here. Cloudera manager […]

Read More

How Humio Uses Kafka

Kresten Krab describes ways that Humio uses Apache Kafka for their product: Humio is a log analytics system built to run both on-prem and as a hosted offering. It is designed for “on-prem first” because, in many logging use cases, you need the privacy and security of managing your own logging solution. And because volume […]

Read More

Categories

June 2018
MTWTFSS
« May Jul »
 123
45678910
11121314151617
18192021222324
252627282930