Distributed training of XGBoost models using sparkdl.xgboost
Preview
This feature is in Public Preview.
Note
sparkdl.xgboost
is deprecated starting with Databricks Runtime 12.0 ML, and is removed in Databricks Runtime 13.0 ML and above. For information about migrating your workloads to xgboost.spark
, see Migration guide for the deprecated sparkdl.xgboost module.
Databricks Runtime ML includes PySpark estimators based on the Python xgboost
package, sparkdl.xgboost.XgboostRegressor
and sparkdl.xgboost.XgboostClassifier
. You can create an ML pipeline based on these estimators. For more information, see XGBoost for PySpark Pipeline.
Databricks strongly recommends that sparkdl.xgboost
users use Databricks Runtime 11.3 LTS ML or above. Previous Databricks Runtime versions are affected by bugs in older versions of sparkdl.xgboost
.
Note
The
sparkdl.xgboost
module is deprecated since Databricks Runtime 12.0 ML. Databricks recommends that you migrate your code to use thexgboost.spark
module instead. See the migration guide.The following parameters from the
xgboost
package are not supported:gpu_id
,output_margin
,validate_features
.The parameters
sample_weight
,eval_set
, andsample_weight_eval_set
are not supported. Instead, use the parametersweightCol
andvalidationIndicatorCol
. See XGBoost for PySpark Pipeline for details.The parameters
base_margin
, andbase_margin_eval_set
are not supported. Use the parameterbaseMarginCol
instead. See XGBoost for PySpark Pipeline for details.The parameter
missing
has different semantics from thexgboost
package. In thexgboost
package, the zero values in a SciPy sparse matrix are treated as missing values regardless of the value ofmissing
. For the PySpark estimators in thesparkdl
package, zero values in a Spark sparse vector are not treated as missing values unless you setmissing=0
. If you have a sparse training dataset (most feature values are missing), Databricks recommends settingmissing=0
to reduce memory consumption and achieve better performance.
Distributed training
Databricks Runtime ML supports distributed XGBoost training using the num_workers
parameter. To use distributed training, create a classifier or regressor and set num_workers
to a value less than or equal to the total number of Spark task slots on your cluster. To use the all Spark task slots, set num_workers=sc.defaultParallelism
.
For example:
classifier = XgboostClassifier(num_workers=sc.defaultParallelism)
regressor = XgboostRegressor(num_workers=sc.defaultParallelism)
Limitations of distributed training
You cannot use
mlflow.xgboost.autolog
with distributed XGBoost.You cannot use
baseMarginCol
with distributed XGBoost.You cannot use distributed XGBoost on an cluster with autoscaling enabled. See Enable autoscaling for instructions to disable autoscaling.
GPU training
Note
Databricks Runtime 11.3 LTS ML includes XGBoost 1.6.1, which does not support GPU clusters with compute capability 5.2 and below.
Databricks Runtime 9.1 LTS ML and above support GPU clusters for XGBoost training. To use a GPU cluster, set use_gpu
to True
.
For example:
classifier = XgboostClassifier(num_workers=N, use_gpu=True)
regressor = XgboostRegressor(num_workers=N, use_gpu=True)
Troubleshooting
During multi-node training, if you encounter a NCCL failure: remote process exited or there was a network error
message, it typically indicates a problem with network communication among GPUs. This issue arises when NCCL (NVIDIA Collective Communications Library) cannot use certain network interfaces for GPU communication.
To resolve, set the cluster’s sparkConf for spark.executorEnv.NCCL_SOCKET_IFNAME
to eth
. This essentially sets the environment variable NCCL_SOCKET_IFNAME
to eth
for all of the workers in a node.