Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem:

H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, however, the latter follows the conventions of Apache Spark’s MLlib library and allows you to build machine learning pipelines that include MLlib transformers. We will focus on the latter API in this post.

H2OAutoML supports classification and regression. The ML models built and tuned by H2OAutoML include Random Forests, Gradient Boosting Machines, Deep Neural Nets, Generalized Linear Models, and Stacked Ensembles.

The post only has a few lines of code but there are a lot of working parts under the surface.

Related Posts

The Zen Of Airflow

Bas Harenslak shows how you can think of The Zen of Python as it applies to Apache Airflow: Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. This allows for concise and flexible scripts but can also be the downside of […]

Read More

Tips For Using PolyBase With Cloudera QuickStart VM

I have a post on using Cloudera’s QuickStart VM with PolyBase: Here’s something which tripped me up a little bit while connecting to Cloudera using SQL Server. The data node name, instead of being quickstart.cloudera like the host name, is actually localhost. You can change this in /etc/cloudera-scm-agent/config.ini. Because PolyBase needs to have direct access to the data nodes, […]

Read More

Categories

January 2019
MTWTFSS
« Dec Feb »
 123456
78910111213
14151617181920
21222324252627
28293031