Use distributed training algorithms with Hyperopt

Note

The managed MLflow integration with Databricks on Google Cloud requires Databricks Runtime for Machine Learning 9.1 LTS or above.

In addition to single-machine training algorithms such as those from scikit-learn, you can use Hyperopt with distributed training algorithms. In this scenario, Hyperopt generates trials with different hyperparameter settings on the driver node. Each trial is executed from the driver node, giving it access to the full cluster resources. This setup works with any distributed machine learning algorithms or libraries, including Apache Spark MLlib and HorovodRunner.

When you use Hyperopt with distributed training algorithms, do not pass a trials argument to fmin(), and specifically, do not use the SparkTrials class. SparkTrials is designed to distribute trials for algorithms that are not themselves distributed. With distributed training algorithms, use the default Trials class, which runs on the cluster driver. Hyperopt evaluates each trial on the driver node so that the ML algorithm itself can initiate distributed training.

Note

Databricks does not support automatic logging to MLflow with the Trials class. When using distributed training algorithms, you must manually call MLflow to log trials for Hyperopt.

Use Hyperopt with MLlib algorithms

The example notebook shows how to use Hyperopt to tune MLlib’s distributed training algorithms.

Hyperopt and MLlib distributed training notebook

Open notebook in new tab

Use Hyperopt with HorovodRunner

HorovodRunner is a general API used to run distributed deep learning workloads on Databricks. HorovodRunner integrates Horovod with Spark’s barrier mode to provide higher stability for long-running deep learning training jobs on Spark.

The example notebook shows how to use Hyperopt to tune distributed training for deep learning based on HorovodRunner.

Hyperopt and HorovodRunner distributed training notebook

Open notebook in new tab