Create a cluster

This article explains the configuration options available for cluster creation. For other methods, see Clusters CLI (legacy), the Clusters API, and Databricks Terraform provider.

Note

These instructions are for Unity Catalog enabled workspaces. For documentation on the non-Unity Catalog legacy UI, see Configure clusters.

The cluster creation UI let’s you select the cluster configuration specifics, including:

Create a new cluster

To create a new cluster, click New > Cluster in your workspace sidebar. This takes you to the New compute page, where you will select your cluster’s specifications.

Note

The configuration options you see on this page will vary depending on the policies you have access to. If you don’t see a setting in your UI, it’s because your policy does not allow you to configure that setting.

Policies

Policies are a set of rules used by admins to limit the configuration options available to users when they create a cluster. To configure a cluster according to a policy, select a policy from the Policy dropdown.

Policies have access control lists that regulate which users and groups have access to the policies.

If a user doesn’t have the unrestricted cluster creation entitlement, then they can only create clusters using their granted policies.

Access modes

Cluster access mode is a security feature that determines who can use a cluster and what data they can access via the cluster. When you create any cluster in Databricks, you must select an access mode.

Note

Databricks recommends that you use shared access mode for all workloads. Only use the assigned access mode if your required functionality is not supported by shared access mode.

Access Mode

Visible to user

UC Support

Supported Languages

Notes

Single user

Always

Yes

Python, SQL, Scala, R

Can be assigned to and used by a single user. See single user limitations.

Shared

Always (Premium plan required)

Yes

Python (on Databricks Runtime 11.1 and above), SQL, Scala (on Unity Catalog-enabled clusters using Databricks Runtime 13.3 and above)

Can be used by multiple users with data isolation among users. See shared limitations.

No Isolation Shared

Admins can hide this cluster type by enforcing user isolation in the admin settings page.

No

Python, SQL, Scala, R

There is a related account-level setting for No Isolation Shared clusters.

Custom

Hidden (For all new clusters)

No

Python, SQL, Scala, R

This option is shown only if you have existing clusters without a specified access mode.

You can upgrade an existing cluster to meet the requirements of Unity Catalog by setting its cluster access mode to Single User or Shared. There are additional access mode limitations for Structured Streaming on Unity Catalog, see Structured Streaming support.

Single user access mode limitations

  • To read from a view, you must have SELECT on all referenced tables and views.

  • Dynamic views are not supported.

  • You cannot use a single user cluster to query tables created by a Unity Catalog-enabled Delta Live Tables pipeline, including streaming tables and materialized views created in Databricks SQL. To query tables created by a Delta Live Tables pipeline, you must use a shared cluster using Databricks Runtime 13.1 and above.

Shared access mode limitations

  • Spark-submit jobs are not supported.

  • Databricks Runtime ML is not supported.

  • Cannot use R, RDD APIs, or clients that directly read the data from cloud storage, such as DBUtils.

  • Can use Scala only on Databricks Runtime 13.3 and above.

  • Cannot use user-defined functions (UDFs), including UDAFs, UDTFs, Pandas on Spark (applyInPandas and mapInPandas), and Hive UDFs.

  • Must run commands on cluster nodes as a low-privilege user forbidden from accessing sensitive parts of the filesystem. In Databricks Runtime 11.3 and below, you can only create network connections to ports 80 and 443.

  • Cannot connect to the instance metadata service or any services running in the Databricks VPC.

Attempts to get around these restrictions will fail. These restrictions are in place so that users can’t access unprivileged data through the cluster.

Do init scripts and libraries work with Unity Catalog access modes?

In Databricks Runtime 13.3 LTS and above, libraries are supported on all access modes. Init scripts are only supported on assigned access mode. Requirements and support vary. See Compute compatibility with libraries and init scripts.

Databricks Runtime versions

Databricks Runtime is the set of core components that run on your clusters. Select the runtime using the Databricks Runtime Version dropdown when you create or edit a cluster. For details on specific Databricks Runtime versions, see Databricks runtimes.

Which Databricks Runtime version should you use?

  • For all-purpose compute, Databricks recommends using the latest Databricks Runtime version. Using the most current version will ensure you have the latest optimizations and most up-to-date compatibility between your code and preloaded packages.

  • For job clusters running operational workloads, consider using the Long Term Support (LTS) Databricks Runtime version. Using the LTS version will ensure you don’t run into compatibility issues and can thoroughly test your workload before upgrading.

  • For advanced machine learning use cases, consider the specialized Databricks Runtime version.

All Databricks Runtime versions include Apache Spark. New versions add components and updates that improve usability, performance, and security.

Enable Photon acceleration

Photon is available on clusters running Databricks Runtime 9.1 LTS and above.

To enable or disable Photon acceleration, select the Use Photon Acceleration checkbox.

Worker and driver node types

A cluster consists of one driver node and zero or more worker nodes. You can pick separate cloud provider instance types for the driver and worker nodes, although by default the driver node uses the same instance type as the worker node. Different families of instance types fit different use cases, such as memory-intensive or compute-intensive workloads.

Worker type

Databricks worker nodes run the Spark executors and other services required for proper functioning clusters. When you distribute your workload with Spark, all the distributed processing happens on worker nodes. Databricks runs one executor per worker node. Therefore, the terms executor and worker are used interchangeably in the context of the Databricks architecture.

Tip

To run a Spark job, you need at least one worker node. If a cluster has zero workers, you can run non-Spark commands on the driver node, but Spark commands will fail.

Worker node IP addresses

Databricks launches worker nodes with two private IP addresses each. The node’s primary private IP address hosts Databricks internal traffic. The secondary private IP address is used by the Spark container for intra-cluster communication. This model allows Databricks to provide isolation between multiple clusters in the same workspace.

Driver type

The driver node maintains state information of all notebooks attached to the cluster. The driver node also maintains the SparkContext, interprets all the commands you run from a notebook or a library on the cluster, and runs the Apache Spark master that coordinates with the Spark executors.

The default value of the driver node type is the same as the worker node type. You can choose a larger driver node type with more memory if you are planning to collect() a lot of data from Spark workers and analyze them in the notebook.

Tip

Since the driver node maintains all of the state information of the notebooks attached, make sure to detach unused notebooks from the driver node.

GPU instance types

For computationally challenging tasks that demand high performance, like those associated with deep learning, Databricks supports clusters accelerated with graphics processing units (GPUs). For more information, see GPU-enabled clusters.

Use preemptible instances

A preemptible VM instance is an instance that you can create and run at a much lower price than normal instances. However, Google Cloud might stop (preempt) these instances if it requires access to those resources for other tasks. Preemptible instances use excess Google Compute Engine capacity, so their availability varies with usage.

When you create a new cluster, you can enable preemptible VM instances in two different ways:

  1. When you create a cluster using the UI, you can click preemptible instances next to the Worker Type details.

  2. When you create an instance pool using the UI, you can set On-Demand/Preemptible to All preemptible, Preemptible with fallback GCP, or On demand GCP. If preemptible VM instances are not available, by default the cluster will fall back to using on-demand VM instances. To configure the fall-back behavior, set gcp_attributes.gcp_availability to PREEMPTIBLE_GCP or PREEMPTIBLE_WITH_FALLBACK_GCP. The default is ON_DEMAND_GCP.

{
  "instance_pool_name": "Preemptible w/o fallback API test",
  "node_type_id": "n1-highmem-4",
  "gcp_attributes": {
      "gcp_availability": "PREEMPTIBLE_GCP"
  }
}

Next, create a new cluster and set Pool to a preemptible instance pool.

Instance types with local SSDs

For the latest list of instance types, the prices of each, and the size of the local SSDs, see the GCP pricing estimator.

Instance types that have local SSDs are encrypted with default Google Cloud server-side encryption and automatically use disk caching for improved performance. Cache sizes on all instance types are set automatically, so you do not need to set the disk usage explicitly.

Configure local SSDs for your cluster

You can configure the number of local SSDs to attach to your cluster when you use the Clusters API to create your cluster.

To configure the number of local SSDs, set a value for local_ssd_count in the gcp_attributes object. Each instance type can only support a certain number of attached local SSDs. The value specified in local_ssd_count must be valid for both the driver and worker instance type. For more information, see the GCP doc for Local SSDs and machine types.

Enable autoscaling

When Enable autoscaling is checked, you can provide a minimum and maximum number of workers for the cluster. Databricks then chooses the appropriate number of workers required to run your job.

To set the minimum and maximum number of workers your cluster will autoscale between, use the Min workers and Max workers fields next to the Worker type dropdown.

If you don’t enable autoscaling, you will enter a fixed number of workers in the Workers field next to the Worker type dropdown.

Note

When the cluster is running, the cluster detail page displays the number of allocated workers. You can compare number of allocated workers with the worker configuration and make adjustments as needed.

Benefits of autoscaling

With autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).

Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to match a workload. This applies especially to workloads whose requirements change over time (like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning requirements are unknown. Autoscaling thus offers two advantages:

  • Workloads can run faster compared to a constant-sized under-provisioned cluster.

  • Autoscaling clusters can reduce overall costs compared to a statically-sized cluster.

Depending on the constant size of the cluster and the workload, autoscaling gives you one or both of these benefits at the same time. The cluster size can go below the minimum number of workers selected when the cloud provider terminates instances. In this case, Databricks continuously retries to re-provision instances in order to maintain the minimum number of workers.

Note

Autoscaling is not available for spark-submit jobs.

Note

Compute auto-scaling has limitations scaling down cluster size for Structured Streaming workloads. Databricks recommends using Delta Live Tables with Enhanced Autoscaling for streaming workloads. See What is Enhanced Autoscaling?.

How autoscaling behaves

Workspace in the Premium and Enterprise pricing plans use optimized autoscaling. Workspaces on the standard pricing plan use standard autoscaling.

Optimized autoscaling has the following characteristics:

  • Scales up from min to max in 2 steps.

  • Can scale down, even if the cluster is not idle, by looking at shuffle file state.

  • Scales down based on a percentage of current nodes.

  • On job clusters, scales down if the cluster is underutilized over the last 40 seconds.

  • On all-purpose clusters, scales down if the cluster is underutilized over the last 150 seconds.

  • The spark.databricks.aggressiveWindowDownS Spark configuration property specifies in seconds how often a cluster makes down-scaling decisions. Increasing the value causes a cluster to scale down more slowly. The maximum value is 600.

Standard autoscaling is used in standard plan workspaces. Standard autoscaling has the following characteristics:

  • Starts with adding 8 nodes. Then scales up exponentially, taking as many steps required to reach the max.

  • Scales down when 90% of the nodes are not busy for 10 minutes and the cluster has been idle for at least 30 seconds.

  • Scales down exponentially, starting with 1 node.

Autoscaling with pools

If you are using an instance pool:

  • Make sure the cluster size requested is less than or equal to the minimum number of idle instances in the pool. If it is larger, cluster startup time will be equivalent to a cluster that doesn’t use a pool.

  • Make sure the maximum cluster size is less than or equal to the maximum capacity of the pool. If it is larger, the cluster creation will fail.

Autoscaling example

If you reconfigure a static cluster to be an autoscaling cluster, Databricks immediately resizes the cluster within the minimum and maximum bounds and then starts autoscaling. As an example, the following table demonstrates what happens to clusters with a certain initial size if you reconfigure a cluster to autoscale between 5 and 10 nodes.

Initial size

Size after reconfiguration

6

6

12

10

3

5

Autoscaling local storage

Google Cloud compute instances can be supplemented with additional storage at the worker level using zonal solid state persistent disks.

With autoscaling local storage, Databricks monitors the amount of free disk space available on your cluster’s Spark workers. If a worker begins to run too low on disk, Databricks automatically resizes the zonal SSD PD before it runs out of disk space. Zonal-SSD PD volumes are attached up to a limit of 5 TB of total disk space per instance (including the instance’s local storage).

To configure autoscaling storage, select Enable autoscaling local storage.

Default provisioned storage

Databricks provisions the following storage for each worker node:

  • A 100 GB boot disk for root volume used by the host operating system and Databricks internal services.

  • Local SSD used by the Spark worker. This hosts Spark services and logs. Each local SSD is 375GB. To configure the number of attached local SSDs, see Instance types with local SSDs.

  • Remote SSDs when storage autoscaling is enabled. These start at 150 GB at creation and autoscale as needed.

Automatic termination

You can also set auto termination for a cluster. During cluster creation, you can specify an inactivity period in minutes after which you want the cluster to terminate.

If the difference between the current time and the last command run on the cluster is more than the inactivity period specified, Databricks automatically terminates that cluster. For more information on cluster termination, see Terminate a cluster.

Local disk encryption

Instance types that have local SSDs are encrypted with default Google Cloud server-side encryption. See Instance types with local SSDs.

Cluster tags

Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. You can specify tags as key-value pairs when you create a cluster, and Databricks applies these tags to Databricks Runtime pods and persistent volumes on the GKE cluster and to DBU usage reports.

The Databricks billable usage graphs in the account console can aggregate usage by individual tags. The billable usage CSV reports downloaded from the same page also include default and custom tags. Tags also propagate to GKE and GCE labels.

For detailed information about how pool and cluster tag types work together, see Monitor usage using cluster and pool tags

To configure cluster tags:

  1. In the Tags section, add a key-value pair for each custom tag.

  2. Click Add.

Google service account

To associate this cluster with a Google service account using Google Identity, click Advanced Options and add your Google service account email address in the Google Service Account field. This value is used to authenticate with the GCS and BigQuery data sources.

Important

The service account that you use to access GCS and BigQuery data sources must be in the same project as the service account that was specified when setting up your Databricks account.

Availability zones

In the cluster configuration page, under Advanced options, you can select the cluster’s availability zone. This setting lets you specify which availability zone you want the cluster to use. By default, the availability zone setting is set to HA (high availability).

You can also choose a specific zone or auto. Choosing a specific zone is useful primarily if your organization has purchased reserved instances in specific availability zones. If you choose auto, the availability zone is automatically picked for you.

High availability (HA) zone

By default, clusters use HA as the availability zone. High availability (HA) is a system feature designed to provide a consistent level of uptime for prolonged periods. The purpose of an HA configuration is to reduce downtime when a zone or instance becomes unavailable. The underlying autoscaler automatically selects the zone to ensure instance usage across zones in a region is balanced.

Warning

Using HA allows GCP to select VMs from multiple availability zones. Because of this, using HA could lead to an increase in price through inter-zone egress charges.

Spark configuration

To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.

  1. On the cluster configuration page, click the Advanced Options toggle.

  2. Click the Spark tab.

    Spark configuration

    In Spark config, enter the configuration properties as one key-value pair per line.

When you configure a cluster using the Clusters API, set Spark properties in the spark_conf field in the Create new cluster API or Update cluster configuration API.

To enforce Spark configurations on clusters, workspace admins can use cluster policies.

Retrieve a Spark configuration property from a secret

Databricks recommends storing sensitive information, such as passwords, in a secret instead of plaintext. To reference a secret in the Spark configuration, use the following syntax:

spark.<property-name> {{secrets/<scope-name>/<secret-name>}}

For example, to set a Spark configuration property called password to the value of the secret stored in secrets/acme_app/password:

spark.password {{secrets/acme-app/password}}

For more information, see Syntax for referencing secrets in a Spark configuration property or environment variable.

Environment variables

You can configure custom environment variables that you can access from init scripts running on a cluster. Databricks also provides predefined environment variables that you can use in init scripts. You cannot override these predefined environment variables.

  1. On the cluster configuration page, click the Advanced Options toggle.

  2. Click the Spark tab.

  3. Set the environment variables in the Environment Variables field.

    Environment Variables field

You can also set environment variables using the spark_env_vars field in the Create new cluster API or Update cluster configuration API.

Cluster log delivery

When you create a cluster, you can specify a location to deliver the logs for the Spark driver node, worker nodes, and events. Logs are delivered every five minutes and archived hourly in your chosen destination. When a cluster is terminated, Databricks guarantees to deliver all logs generated up until the cluster was terminated.

The destination of the logs depends on the cluster ID. If the specified destination is dbfs:/cluster-log-delivery, cluster logs for 0630-191345-leap375 are delivered to dbfs:/cluster-log-delivery/0630-191345-leap375.

To configure the log delivery location:

  1. On the cluster configuration page, click the Advanced Options toggle.

  2. Click the Logging tab.

  3. Select a destination type.

  4. Enter the cluster log path.

    The log path must be a DBFS path that begins with dbfs:/.

Note

This feature is also available in the REST API. See the Clusters API.