In addition to single-machine training algorithms such as those from scikit-learn, you can use Hyperopt with distributed training algorithms. In this scenario, Hyperopt generates trials with different hyperparameter settings on the driver node. Each trial is executed from the driver node, giving it access to the full cluster resources. This setup works with any distributed machine learning algorithms or libraries, including Apache Spark MLlib and HorovodRunner.
When you use Hyperopt with distributed training algorithms, do not pass a
trials argument to
fmin(), and specifically, do not use the
SparkTrials is designed to distribute trials for algorithms that are not themselves distributed. With distributed training algorithms, use the default
Trials class, which runs on the cluster driver. Hyperopt evaluates each trial on the driver node so that the ML algorithm itself can initiate distributed training.
Databricks does not support automatic logging to MLflow with the
Trials class. When using distributed training algorithms, you must manually call MLflow to log trials for Hyperopt.
The example notebook shows how to use Hyperopt to tune MLlib’s distributed training algorithms.
HorovodRunner is a general API used to run distributed deep learning workloads on Databricks. HorovodRunner integrates Horovod with Spark’s barrier mode to provide higher stability for long-running deep learning training jobs on Spark.
The example notebook shows how to use Hyperopt to tune distributed training for deep learning based on HorovodRunner.