We’ve learned a lot by working with customers using SparkML, both internal and external to Microsoft. Customers have found Spark to be a powerful platform for building scalable ML models. However, they struggle with low-level APIs, for example to index strings, assemble feature vectors and coerce data into a layout expected by machine learning algorithms. Microsoft Machine Learning for Apache Spark (MMLSpark) simplifies many of these common tasks for building models in PySpark, making you more productive and letting you focus on the data science.
The library provides simplified consistent APIs for handling different types of data such as text or categoricals. Consider, for example, a DataFrame that contains strings and numeric values from the Adult Census Income dataset, where “income” is the prediction target.
It’s an open source project as well, so that barrier to entry is lowered significantly.