Configure clusters

Note

The CLI feature is unavailable on Databricks on Google Cloud as of this release.

This article explains the configuration options available when you create and edit Databricks clusters. It focuses on creating and editing clusters using the UI. For other methods, see Clusters CLI and Clusters API.

For help deciding what combination of configuration options suits your needs best, see cluster configuration best practices.

Create cluster

Cluster policy

A cluster policy limits the ability to configure clusters based on a set of rules. The policy rules limit the attributes or attribute values available for cluster creation. Cluster policies have ACLs that limit their use to specific users and groups and thus limit which policies you can select when you create a cluster.

To configure a cluster policy, select the cluster policy in the Policy drop-down.

Select cluster policy

Note

If no policies have been created in the workspace, the Policy drop-down does not display.

If you have:

  • Cluster create permission, you can select the Unrestricted policy and create fully-configurable clusters. The Unrestricted policy does not limit any cluster attributes or attribute values.
  • Both cluster create permission and access to cluster policies, you can select the Unrestricted policy and the policies you have access to.
  • Access to cluster policies only, you can select the policies you have access to.

Cluster mode

Databricks supports three cluster modes: Standard, High Concurrency, and Single Node. The default cluster mode is Standard.

Note

The cluster configuration includes an auto terminate setting whose default value depends on cluster mode:

  • Standard and Single Node clusters terminate automatically after 120 minutes by default.
  • High Concurrency clusters do not terminate automatically by default.

Important

You cannot change the cluster mode after a cluster is created. If you want a different cluster mode, you must create a new cluster.

Standard clusters

A Standard cluster is recommended for a single user. Standard clusters can run workloads developed in any language: Python, SQL, R, and Scala.

High Concurrency clusters

A High Concurrency cluster is a managed cloud resource. The key benefits of High Concurrency clusters are that they provide fine-grained sharing for maximum resource utilization and minimum query latencies.

High Concurrency clusters can run workloads developed in SQL, Python, and R. The performance and security of High Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala.

In addition, only High Concurrency clusters support table access control.

To create a High Concurrency cluster, set Cluster Mode to High Concurrency.

High Concurrency cluster mode

Single Node clusters

A Single Node cluster has no workers and runs Spark jobs on the driver node.

In contrast, a Standard cluster requires at least one Spark worker node in addition to the driver node to execute Spark jobs.

To create a Single Node cluster, set Cluster Mode to Single Node.

Single Node cluster mode

To learn more about working with Single Node clusters, see _.

pool

To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances. The cluster is created using instances in the pool. If an pool does not have sufficient idle resources to create the requested driver or workers, the pool expands by allocating new instances from the instance provider. When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.

See Pools to learn more about working with pools in Databricks.

Databricks Runtime

Databricks runtimes are the set of core components that run on your clusters. All Databricks runtimes include Apache Spark and add components and updates that improve usability, performance, and security. For details, see Databricks runtimes.

Databricks offers several types of runtimes and several versions of those runtime types in the Databricks Runtime Version drop-down when you create or edit a cluster.

Select Runtime version

Python version

Databricks on Google Cloud does not support Python 2.

Frequently asked questions (FAQ)

What libraries are installed on Python clusters?

For details on the specific libraries that are installed, see the Databricks runtime release notes.

Will my existing .egg libraries work with Python 3?

It depends on whether your existing egg library is cross-compatible with both Python 2 and 3. If the library does not support Python 3 then either library attachment will fail or runtime errors will occur.

For a comprehensive guide on porting code to Python 3 and writing code compatible with both Python 2 and 3, see Supporting Python 3.

Can I still install Python libraries using init scripts?

A common use case for Cluster node initialization scripts is to install packages.

The pip command is referring to the pip in the correct Python virtual environment. However, if you are using an init script to create the Python virtual environment, always use the absolute path to access python and pip.

Cluster node type

A cluster consists of one driver node and zero or more worker nodes.

You can pick separate cloud provider instance types for the driver and worker nodes, although by default the driver node uses the same instance type as the worker node. Different families of instance types fit different use cases, such as memory-intensive or compute-intensive workloads.

Driver node

The driver node maintains state information of all notebooks attached to the cluster. The driver node also maintains the SparkContext and interprets all the commands you run from a notebook or a library on the cluster, and runs the Apache Spark master that coordinates with the Spark executors.

The default value of the driver node type is the same as the worker node type. You can choose a larger driver node type with more memory if you are planning to collect() a lot of data from Spark workers and analyze them in the notebook.

Tip

Since the driver node maintains all of the state information of the notebooks attached, make sure to detach unused notebooks from the driver node.

Worker node

Databricks worker nodes run the Spark executors and other services required for the proper functioning of the clusters. When you distribute your workload with Spark, all of the distributed processing happens on worker nodes. Databricks runs one executor per worker node; therefore the terms executor and worker are used interchangeably in the context of the Databricks architecture.

Tip

To run a Spark job, you need at least one worker node. If a cluster has zero workers, you can run non-Spark commands on the driver node, but Spark commands will fail.

Cluster instance types with local SSDs

For the latest list of instance types, the prices of each, and the size of the local SSDs, see the GCP pricing estimator.

Instance types that have local SSDs are encrypted with default Google Cloud server-side encryption.

Instance types with local SSDs automatically use Delta cache for improved performance. Cache sizes on all instance types are set automatically, so you do not need to set the disk usage explicitly.

Cluster size and autoscaling

When you create a Databricks cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster.

When you provide a fixed size cluster, Databricks ensures that your cluster has the specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling.

With autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).

Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to match a workload. This applies especially to workloads whose requirements change over time (like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning requirements are unknown. Autoscaling thus offers two advantages:

  • Workloads can run faster compared to a constant-sized under-provisioned cluster.
  • Autoscaling clusters can reduce overall costs compared to a statically-sized cluster.

Depending on the constant size of the cluster and the workload, autoscaling gives you one or both of these benefits at the same time. The cluster size can go below the minimum number of workers selected when the cloud provider terminates instances. In this case, Databricks continuously retries to re-provision instances in order to maintain the minimum number of workers.

Note

Autoscaling is not available for spark-submit jobs.

Autoscaling types

Databricks offers two types of cluster node autoscaling: standard and optimized. For a discussion of the benefits of optimized autoscaling, see the blog post on Optimized Autoscaling.

Automated (job) clusters always use optimized autoscaling. The type of autoscaling performed on all-purpose clusters depends on the workspace configuration.

Standard autoscaling is used by all-purpose clusters in workspaces in the Standard pricing tier. Optimized autoscaling is used by all-purpose clusters in the Databricks Premium Plan.

How autoscaling behaves

Autoscaling behaves differently depending on whether it is optimized or standard and whether applied to an all-purpose or a job cluster.

Optimized autoscaling

  • Scales up from min to max in 2 steps.
  • Can scale down even if the cluster is not idle by looking at shuffle file state.
  • Scales down based on a percentage of current nodes.
  • On job clusters, scales down if the cluster is underutilized over the last 40 seconds.
  • On all-purpose clusters, scales down if the cluster is underutilized over the last 150 seconds.
  • The spark.databricks.aggressiveWindowDownS Spark configuration property specifies in seconds how often a cluster makes down-scaling decisions. Increasing the value causes a cluster to scale down more slowly. The maximum value is 600.

Standard autoscaling

  • Starts with adding 8 nodes. Thereafter, scales up exponentially, but can take many steps to reach the max. You can customize the first step by setting the spark.databricks.autoscaling.standardFirstStepUp Spark configuration property.
  • Scales down only when the cluster is completely idle and it has been underutilized for the last 10 minutes.
  • Scales down exponentially, starting with 1 node.

Enable and configure autoscaling

To allow Databricks to resize your cluster automatically, you enable autoscaling for the cluster and provide the min and max range of workers.

  1. Enable autoscaling.

    • All-Purpose cluster - On the Create Cluster page, select the Enable autoscaling checkbox in the Autopilot Options box:

      Enable_autoscaling for interactive clusters
    • Job cluster - On the Configure Cluster page, select the Enable autoscaling checkbox in the Autopilot Options box:

      Enable autoscaling for job clusters
  2. Configure the min and max workers.

    Configure min and max workers

    When the cluster is running, the cluster detail page displays the number of allocated workers. You can compare number of allocated workers with the worker configuration and make adjustments as needed.

Important

If you are using an instance pool:

  • Make sure the cluster size requested is less than or equal to the minimum number of idle instances in the pool. If it is larger, cluster startup time will be equivalent to a cluster that doesn’t use a pool.
  • Make sure the maximum cluster size is less than or equal to the maximum capacity of the pool. If it is larger, the cluster creation will fail.

Autoscaling example

If you reconfigure a static cluster to be an autoscaling cluster, Databricks immediately resizes the cluster within the minimum and maximum bounds and then starts autoscaling. As an example, the following table demonstrates what happens to clusters with a certain initial size if you reconfigure a cluster to autoscale between 5 and 10 nodes.

Initial size Size after reconfiguration
6 6
12 10
3 5

Google Cloud configurations

When you configure a cluster’s Google Cloud instances you can specify Google Cloud-specific options.

Use Preemptible Instances

To use what Google Cloud calls preemptible VM instances for your Databricks nodes, in the Advanced Options section of cluster configuration, select Use Preemptible Executors.

A preemptible VM is an instance that you can create and run at a much lower price than normal instances. However, Google Cloud might stop (preempt) these instances if it requires access to those resources for other tasks. Preemptible instances are excess Compute Engine capacity, so their availability varies with usage.

Google service account

To associate this cluster with a Google service account using Google Identity, click Advanced Options and add your Google service account email address in the Google Service Account field. This value is used to authenticate with the GCS and BigQuery data sources.

Important

The service account that you use to access GCS and BigQuery data sources must be in the same project as the service account that was specified when setting up your Databricks account.

Local disk encryption

Instance types that have local SSDs are encrypted with default Google Cloud server-side encryption. See Cluster instance types with local SSDs.

Spark configuration

To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.

  1. On the cluster configuration page, click the Advanced Options toggle.

  2. Click the Spark tab.

    Spark configuration

    In Spark config, enter the configuration properties as one key-value pair per line.

When you configure a cluster using the Clusters API, set Spark properties in the spark_conf field in the Create cluster request or Edit cluster request.

Environment variables

You can set environment variables that you can access from scripts running on a cluster.

  1. On the cluster configuration page, click the Advanced Options toggle.

  2. Click the Spark tab.

  3. Set the environment variables in the Environment Variables field.

    Environment Variables field

You can also set environment variables using the spark_env_vars field in the Create cluster request or Edit cluster request Clusters API endpoints.

Note

The environment variables you set in this field are not available in Cluster node initialization scripts. Init scripts support only a limited set of predefined Environment variables.

SSH access to clusters

Cluster SSH is not supported in this release.

For a related feature, see Web terminal.

Cluster log delivery

When you create a cluster, you can specify a location to deliver the logs for the Spark driver node, worker nodes, and events. Logs are delivered every five minutes to your chosen destination. When a cluster is terminated, Databricks guarantees to deliver all logs generated up until the cluster was terminated.

The destination of the logs depends on the cluster ID. If the specified destination is dbfs:/cluster-log-delivery, cluster logs for 0630-191345-leap375 are delivered to dbfs:/cluster-log-delivery/0630-191345-leap375.

To configure the log delivery location:

  1. On the cluster configuration page, click the Advanced Options toggle.

  2. Click the Logging tab.

  3. Select a destination type.

  4. Enter the cluster log path.

    The log path must be a DBFS path that begins with dbfs://.

Note

This feature is also available in the REST API. See Clusters API.

Init scripts

A cluster node initialization—or init—script is a shell script that runs during startup for each cluster node before the Spark driver or worker JVM starts. You can use init scripts to install packages and libraries not included in the Databricks runtime, modify the JVM system classpath, set system properties and environment variables used by the JVM, or modify Spark configuration parameters, among other configuration tasks.

You can attach init scripts to a cluster by expanding the Advanced Options section and clicking the Init Scripts tab.

For detailed instructions, see Cluster node initialization scripts.