MLlib is one of the primary extensions of Spark, along with Spark SQL, Spark Streaming and GraphX. It is a machine learning framework built from the ground up to be massively scalable and operate within Spark. This makes it an excellent choice for machine learning applications that need to crunch extremely large amounts of data. You can read more about Spark MLlib here.
In order to leverage Spark MLlib, we obviously need a way to execute Spark code. In our minds, there’s no better tool for this than Azure Databricks. In the previous post, we covered the creation of an Azure Databricks environment. We’re going to reuse that environment for this post as well. We’ll also use the same dataset that we’ve been using, which contains information about individual customers. This dataset was originally designed to predict Income based on a number of factors. However, we left the income out of this dataset a few posts back for reasons that were important then. So, we’re actually going to use this dataset to predict “Hours Per Week” instead.
Check it out. And Brad’s not joking when he says the resulting model is terrible. But that’s okay, because it was never about the model.