Defining TF-IDF

Bruno Stecanella explains the concept behind TF-IDF:

TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

However, if the word Bug appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant. For example, if what we’re doing is trying to find out which topics some NPS responses belong to, the word Bug would probably end up being tied to the topic Reliability, since most responses containing that word would be about that topic.

This makes the technique useful for natural language processing, especially in classification problems.

Related Posts

Non-Linear Classifiers with Support Vector Machines

Rahul Khanna continues a series on support vector machines: In this blog post, we will look at a detailed explanation of how to use SVM for complex decision boundaries and build Non-Linear Classifiers using SVM. The primary method for doing this is by using Kernels. In linear SVM we find margin maximizing hyperplane with features […]

Read More

Vectors for Programmers

John Mount has a couple of videos available: We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material. Click through […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Categories

May 2019
MTWTFSS
« Apr  
 12345
6789101112
13141516171819
20212223242526
2728293031