Distributed training of XGBoost models using sparkdl.xgboost
Preview
This feature is in Public Preview.
Note
sparkdl.xgboost
is deprecated starting with Databricks Runtime 12.0 ML, and is removed in Databricks Runtime 13.0 ML and above. For information about migrating your workloads to xgboost.spark
, see Migration guide for the deprecated sparkdl.xgboost module.
Databricks Runtime ML includes PySpark estimators based on the Python xgboost
package, sparkdl.xgboost.XgboostRegressor
and sparkdl.xgboost.XgboostClassifier
. You can create an ML pipeline based on these estimators. For more information, see XGBoost for PySpark Pipeline.
Databricks strongly recommends that sparkdl.xgboost
users use Databricks Runtime 11.3 LTS ML or above. Previous Databricks Runtime versions are affected by bugs in older versions of sparkdl.xgboost
.
Note
The
sparkdl.xgboost
module is deprecated since Databricks Runtime 12.0 ML. Databricks recommends that you migrate your code to use thexgboost.spark
module instead. See the migration guide.The following parameters from the
xgboost
package are not supported:gpu_id
,output_margin
,validate_features
.The parameters
sample_weight
,eval_set
, andsample_weight_eval_set
are not supported. Instead, use the parametersweightCol
andvalidationIndicatorCol
. See XGBoost for PySpark Pipeline for details.The parameters
base_margin
, andbase_margin_eval_set
are not supported. Use the parameterbaseMarginCol
instead. See XGBoost for PySpark Pipeline for details.The parameter
missing
has different semantics from thexgboost
package. In thexgboost
package, the zero values in a SciPy sparse matrix are treated as missing values regardless of the value ofmissing
. For the PySpark estimators in thesparkdl
package, zero values in a Spark sparse vector are not treated as missing values unless you setmissing=0
. If you have a sparse training dataset (most feature values are missing), Databricks recommends settingmissing=0
to reduce memory consumption and achieve better performance.
Distributed training
Databricks Runtime ML supports distributed XGBoost training using the num_workers
parameter. To use distributed training, create a classifier or regressor and set num_workers
to a value less than or equal to the number of workers on your cluster.
For example:
classifier = XgboostClassifier(num_workers=N)
regressor = XgboostRegressor(num_workers=N)
Limitations of distributed training
You cannot use
mlflow.xgboost.autolog
with distributed XGBoost.You cannot use
baseMarginCol
with distributed XGBoost.You cannot use distributed XGBoost on an cluster with autoscaling enabled. See Enable autoscaling for instructions to disable autoscaling.
GPU training
Note
Databricks Runtime 11.3 LTS ML includes XGBoost 1.6.1, which does not support GPU clusters with compute capability 5.2 and below.
Databricks Runtime 9.1 LTS ML and above support GPU clusters for XGBoost training. To use a GPU cluster, set use_gpu
to True
.
For example:
classifier = XgboostClassifier(num_workers=N, use_gpu=True)
regressor = XgboostRegressor(num_workers=N, use_gpu=True)