The managed MLflow integration with Databricks on Google Cloud requires Databricks Runtime for Machine Learning 9.1 LTS or above.
Learn how to train machine learning models using XGBoost in Databricks. Databricks Runtime for Machine Learning includes XGBoost libraries for both Python and Scala.
Versions of XGBoost 1.2.0 and lower have a bug that can cause the shared Spark context to be killed if XGBoost model training fails. The only way to recover is to restart the cluster. Databricks Runtime 7.5 ML and lower include a version of XGBoost that is affected by this bug. To install a different version of XGBoost, see Install XGBoost on Databricks.
You can train models using the Python
xgboost package. This package supports only single node workloads. To train a PySpark ML pipeline and take advantage of distributed training, see Distributed training of XGBoost models.
For distributed training of XGBoost models, Databricks includes PySpark estimators based on the
xgboost package. Databricks also includes the Scala package
xgboost-4j. For details and example notebooks, see the following:
Distributed training of XGBoost models using xgboost.spark (Databricks Runtime 12.0 ML and above)
Distributed training of XGBoost models using sparkdl.xgboost (deprecated starting with Databricks Runtime 12.0 ML)