Multi-Class Text Classification In Python

Susan Li has a series on multi-class text classification in Python.  First up is analysis with PySpark:

Our task is to classify San Francisco Crime Description into 33 pre-defined categories. The data can be downloaded from Kaggle.

Given a new crime description comes in, we want to assign it to one of 33 categories. The classifier makes the assumption that each new crime description is assigned to one and only one category. This is multi-class text classification problem.

    • * Input: Descript
    • * Example: “STOLEN AUTOMOBILE”
    • * Output: Category
    • * Example: VEHICLE THEFT

To solve this problem, we will use a variety of feature extraction technique along with different supervised machine learning algorithms in Spark. Let’s get started!

Then, she looks at multi-class text classification with scikit-learn:

The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.

One common approach for extracting features from the text is to use the bag of words model: a model where for each document, a complaint narrative in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.

Specifically, for each term in our dataset, we will calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf.

This is a nice pair of articles on the topic.  Natural Language Processing (and dealing with text in general) is one place where Python is well ahead of R in terms of functionality and ease of use.

Related Posts

Bayes’ Theorem In A Picture

Stephanie Glen gives us the basics of Bayes’ Theorem in a picture: Bayes’ Theorem is a way to calculate conditional probability. The formula is very simple to calculate, but it can be challenging to fit the right pieces into the puzzle. The first challenge comes from defining your event (A) and test (B); The second […]

Read More

Basic Forensic Accounting Techniques

I continue my series on forensic accounting techniques: Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale […]

Read More

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031