Using Spark MLlib For Categorization

Taras Matyashovskyy uses Apache Spark MLlib to categorize songs in different genres:

The roadmap for implementation was pretty straightforward:

  • Collect the raw data set of the lyrics (~65k sentences in total):

    • Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.
    • Abba, Ace of Base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.
  • Create training set, i.e. label (0 for metal | 1 for pop) + features (represented as double vectors)

  • Train logistic regression that is the obvious selection for the classification

This is a supervised learning problem, and is pretty fun to walk through.

Related Posts

Batch Consumption from Kafka with Spark

Swapnil Chougule shares a few tips on performing batch processing of a Kafka topic using Apache Spark: Spark as a compute engine is very widely accepted by most industries. Most of the old data platforms based on MapReduce jobs have been migrated to Spark-based jobs, and some are in the phase of migration. In short, […]

Read More

Securely Accessing External Resources From Databricks AWS

Itai Weiss shows how you can securely hit external data sources when using Databricks for AWS: For security purposes, Databricks Apache Spark clusters are deployed in an isolated VPC dedicated to Databricks within the customer’s account. In order to run their data workloads, there is a need to have secure connectivity between the Databricks Spark […]

Read More

Categories

November 2016
MTWTFSS
« Oct Dec »
 123456
78910111213
14151617181920
21222324252627
282930