Using Spark MLlib For Categorization

Taras Matyashovskyy uses Apache Spark MLlib to categorize songs in different genres:

The roadmap for implementation was pretty straightforward:

  • Collect the raw data set of the lyrics (~65k sentences in total):

    • Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.
    • Abba, Ace of Base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.
  • Create training set, i.e. label (0 for metal | 1 for pop) + features (represented as double vectors)

  • Train logistic regression that is the obvious selection for the classification

This is a supervised learning problem, and is pretty fun to walk through.

Related Posts

Patterns for ML Models in Production

Jeff Fletcher shows four patterns for productionalizing Machine Learning models, as well as some things to take care of once you’re in production: Operational DatabasesThis option is sometimes considered to be  real-time as the information is provided “as its needed,” but it is still a batch method. Using our telco example, a batch process can […]

Read More

When Not to Use Spark

Ramandeep Kaur gives us several cases when it makes sense not to use Apache Spark: There can be use cases where Spark would be the inevitable choice. Spark considered being an excellent tool for use cases like ETL of a large amount of a dataset, analyzing a large set of data files, Machine learning, and […]

Read More

Categories

November 2016
MTWTFSS
« Oct Dec »
 123456
78910111213
14151617181920
21222324252627
282930