Susan Li has a series on multi-class text classification in Python. First up is analysis with PySpark:
Our task is to classify San Francisco Crime Description into 33 pre-defined categories. The data can be downloaded from Kaggle.
Given a new crime description comes in, we want to assign it to one of 33 categories. The classifier makes the assumption that each new crime description is assigned to one and only one category. This is multi-class text classification problem.
-
-
* Example: “STOLEN AUTOMOBILE”
-
-
To solve this problem, we will use a variety of feature extraction technique along with different supervised machine learning algorithms in Spark. Let’s get started!
Then, she looks at multi-class text classification with scikit-learn:
The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.
One common approach for extracting features from the text is to use the bag of words model: a model where for each document, a complaint narrative in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.
Specifically, for each term in our dataset, we will calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf.
This is a nice pair of articles on the topic. Natural Language Processing (and dealing with text in general) is one place where Python is well ahead of R in terms of functionality and ease of use.