Some GPU-enabled instance types are in Beta and are marked as such in the drop-down list when you select the driver and worker types during cluster creation.
Databricks supports clusters accelerated with graphics processing units (GPUs). This article describes how to create clusters with GPU-enabled instances and describes the GPU drivers and libraries installed on those instances.
To learn more about deep learning on GPU-enabled clusters, see Deep learning.
Creating a GPU cluster is similar to creating any Spark cluster. You should keep in mind the following:
The Databricks Runtime Version must be a GPU-enabled version, such as Runtime 13.3 LTS ML (GPU, Scala 2.12.15, Spark 3.4.1).
The Worker Type and Driver Type must be GPU instance types.
For single-machine workflows without Spark, you can set the number of workers to zero.
Databricks supports the following instance types:
A2 machine family: a2-highgpu-1g, a2-highgpu-2g, a2-highgpu-4g, a2-highgpu-8g, a2-megagpu-16g
See GCP accelerator-optimized machines for more information on these instance types, and GCP regions to check where these instances are available. Your Databricks deployment must reside in a supported region to launch GPU-enabled clusters.
Databricks Runtime 9.1 LTS ML and above support GPU-aware scheduling from Apache Spark 3.0. Databricks preconfigures it on GPU clusters for you.
GPU scheduling is not enabled on Single Node clusters.
spark.task.resource.gpu.amount is the only Spark config related to GPU-aware scheduling that you might need to change.
The default configuration uses one GPU per task, which is ideal for distributed inference workloads and distributed training, if you use all GPU nodes.
To do distributed training on a subset of nodes, which helps reduce communication overhead during distributed training, Databricks recommends setting
spark.task.resource.gpu.amount to the number of GPUs per worker node
in the cluster Spark configuration.
For PySpark tasks, Databricks automatically remaps assigned GPU(s) to indices 0, 1, ….
Under the default configuration that uses one GPU per task, your code can simply use the default GPU without checking which GPU is assigned to the task.
If you set multiple GPUs per task, for example 4, your code can assume that the indices of the assigned GPUs are always 0, 1, 2, and 3. If you do need the physical indices of the assigned GPUs, you can get them from the
CUDA_VISIBLE_DEVICES environment variable.
If you use Scala, you can get the indices of the GPUs assigned to the task from
Databricks installs the NVIDIA driver and libraries required to use GPUs on Spark driver and worker instances:
CUDA Toolkit, installed under
cuDNN: NVIDIA CUDA Deep Neural Network Library.
NCCL: NVIDIA Collective Communications Library.
The version of the NVIDIA driver included is 525.105.17, which supports CUDA 11.0.
For the versions of the libraries included, see the release notes for the specific Databricks Runtime version you are using.
This software contains source code provided by NVIDIA Corporation. Specifically, to support GPUs, Databricks includes code from CUDA Samples.
When you select a GPU-enabled “Databricks Runtime Version” in Databricks, you implicitly agree to the terms and conditions outlined in the NVIDIA EULA with respect to the CUDA, cuDNN, and Tesla libraries, and the NVIDIA End User License Agreement (with NCCL Supplement) for the NCCL library.
With Databricks on Google Cloud, commonly used NVIDIA executables like
nvidia-smiare not included in the
PATHenvironment variable. Instead, they are in
/usr/local/nvidia/bin. For example, to use
nvidia-smiyou must use either the web terminal or
%shnotebook magic commands to run
Monitoring cluster metrics using Ganglia is not supported on Databricks on Google Cloud.