Taras Matyashovskyy uses Apache Spark MLlib to categorize songs in different genres:
The roadmap for implementation was pretty straightforward:
-
Collect the raw data set of the lyrics (~65k sentences in total):
- Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.
- Abba, Ace of Base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.
-
Create training set, i.e. label (0 for metal | 1 for pop) + features (represented as double vectors)
-
Train logistic regression that is the obvious selection for the classification
This is a supervised learning problem, and is pretty fun to walk through.