Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools
TrainValidationSplit. These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info.
Databricks Runtime 5.3 and 5.3 ML and above support automatic MLflow tracking for MLlib tuning in Python.
With this feature, PySpark
TrainValidationSplitwill automatically log to MLflow, organizing runs in a hierarchy and logging hyperparameters and the evaluation metric. For example, calling
CrossValidator.fit()will log one parent run. Under this run,
CrossValidatorwill log one child run for each hyperparameter setting, and each of those child runs will include the hyperparameter setting and the evaluation metric. Comparing these runs in the MLflow UI helps with visualizing the effect of tuning each hyperparameter.
Hyperparameter tuning is critical for some of the more complex algorithms like random forests, gradient boosting, and neural networks.