Matthieu Lamairesse shows us how we can use Dask to perform distributed ML model training:
Dask is an open-source parallel computing framework written natively in Python (initially released 2014). It has a significant following and support largely due to its good integration with the popular Python ML ecosystem triumvirate that is NumPy, Pandas and Scikit-learn.
Why Dask over other distributed machine learning frameworks?
In the context of this article it’s about Dask’s tight integration with Sckit-learn’s JobLib parallel computing library that allows us to distribute Scikit-learn code with (almost) no code change, making it a very interesting framework to accelerate ML training.
Click through for an interesting article and an example of using this on Cloudera’s ML platform.