GPU-enabled compute

Note

Some GPU-enabled instance types are in Beta and are marked as such in the drop-down list when you select the driver and worker types during compute creation.

Overview

Databricks supports compute accelerated with graphics processing units (GPUs). This article describes how to create compute with GPU-enabled instances and describes the GPU drivers and libraries installed on those instances.

To learn more about deep learning on GPU-enabled compute, see Deep learning.

Create a GPU compute

Creating a GPU compute is similar to creating any compute. You should keep in mind the following:

  • The Databricks Runtime Version must be a GPU-enabled version, such as Runtime 13.3 LTS ML (GPU, Scala 2.12.15, Spark 3.4.1).

  • The Worker Type and Driver Type must be GPU instance types.

Supported instance types

Databricks supports the following instance types:

  • GPU Type: NVIDIA A100 80GB GPU

Instance Name

Number of GPUs

GPU Memory

vCPUs

CPU Memory

a2-ultragpu-1g

1

80GB

12

170GB

a2-ultragpu-2g

2

80GB x 2

24

340GB

a2-ultragpu-4g

4

80GB x 4

48

680GB

a2-ultragpu-8g

8

80GB x 8

96

1360GB

  • GPU Type: NVIDIA A100 40GB GPU

Instance Name

Number of GPUs

GPU Memory

vCPUs

CPU Memory

a2-highgpu-1g

1

40GB

12

85GB

a2-highgpu-2g

2

40GB x 2

24

170GB

a2-highgpu-4g

4

40GB x 4

48

340GB

a2-highgpu-8g

8

40GB x 8

96

680GB

a2-megagpu-16g

16

40GB x 16

96

1360GB

  • GPU Type: NVIDIA L4 GPU

Instance Name

Number of GPUs

GPU Memory

vCPUs

CPU Memory

g2-standard-4

1

24GB

4

16GB

g2-standard-8

1

24GB

8

32GB

g2-standard-12

1

24GB

12

48GB

g2-standard-16

1

24GB

16

64GB

g2-standard-24

2

24GB x 2

24

96GB

g2-standard-32

1

24GB

32

128GB

g2-standard-48

4

24GB x 4

44

192GB

g2-standard-96

8

24GB x 8

96

384GB

See GCP accelerator-optimized machines for more information on these instance types, and GCP regions to check where these instances are available. Your Databricks deployment must reside in a supported region to launch GPU-enabled compute.

GPU scheduling

GPU scheduling distributes Spark tasks efficiently across a large number of GPUs.

Databricks Runtime 9.1 LTS ML and above support GPU-aware scheduling from Apache Spark 3.0. Databricks preconfigures it on GPU compute for you.

Note

GPU scheduling is not enabled on single-node compute.

GPU scheduling for AI and ML

spark.task.resource.gpu.amount is the only Spark config related to GPU-aware scheduling that you may need to configure. The default configuration uses one GPU per task, which is a good baseline for distributed inference workloads and distributed training if you use all GPU nodes.

To reduce communication overhead during distributed training, Databricks recommends setting spark.task.resource.gpu.amount to the number of GPUs per worker node in the compute Spark configuration. This creates only one Spark task for each Spark worker and assigns all GPUs in that worker node to the same task.

To increase parallelization for distributed deep learning inference, you can set spark.task.resource.gpu.amount to fractional values such as 1/2, 1/3, 1/4, … 1/N. This creates more Spark tasks than there are GPUs, allowing more concurrent tasks to handle inference requests in parallel. For example, if you set spark.task.resource.gpu.amount to 0.5, 0.33, or 0.25, then the available GPUs will be split among double, triple, or quadruple the number of tasks.

GPU indices

For PySpark tasks, Databricks automatically remaps assigned GPU(s) to zero-based indices. For the default configuration that uses one GPU per task, you can use the default GPU without checking which GPU is assigned to the task. If you set multiple GPUs per task, for example, 4, the indices of the assigned GPUs are always 0, 1, 2, and 3. If you do need the physical indices of the assigned GPUs, you can get them from the CUDA_VISIBLE_DEVICES environment variable.

If you use Scala, you can get the indices of the GPUs assigned to the task from TaskContext.resources().get("gpu").

NVIDIA GPU driver, CUDA, and cuDNN

Databricks installs the NVIDIA driver and libraries required to use GPUs on Spark driver and worker instances:

  • CUDA Toolkit, installed under /usr/local/cuda.

  • cuDNN: NVIDIA CUDA Deep Neural Network Library.

  • NCCL: NVIDIA Collective Communications Library.

The version of the NVIDIA driver included is 525.105.17, which supports CUDA 11.0.

For the versions of the libraries included, see the release notes for the specific Databricks Runtime version you are using.

Note

This software contains source code provided by NVIDIA Corporation. Specifically, to support GPUs, Databricks includes code from CUDA Samples.

NVIDIA End User License Agreement (EULA)

When you select a GPU-enabled “Databricks Runtime Version” in Databricks, you implicitly agree to the terms and conditions outlined in the NVIDIA EULA with respect to the CUDA, cuDNN, and Tesla libraries, and the NVIDIA End User License Agreement (with NCCL Supplement) for the NCCL library.

Limitations

  • You cannot create a new GPU compute when you schedule a job from a notebook. You can run a job on an existing GPU compute only if it was created in the new compute UI.

  • With Databricks on Google Cloud, commonly used NVIDIA executables like nvidia-smi are not included in the PATH environment variable. Instead, they are in /usr/local/nvidia/bin. For example, to use nvidia-smi you must use either the web terminal or %sh notebook magic commands to run /usr/local/nvidia/bin/nvidia-smi.

  • Monitoring compute metrics using Ganglia is not supported on Databricks on Google Cloud.