The CLI feature is unavailable on Databricks on Google Cloud as of this release.
An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM starts.
Some examples of tasks performed by init scripts include:
- Install packages and libraries not included in Databricks Runtime. To install Python packages, use the Databricks
pipbinary located at
/databricks/python/bin/pipto ensure that Python packages install into the Databricks Python virtual environment rather than the system Python environment. For example,
/databricks/python/bin/pip install <package-name>.
- Modify the JVM system classpath in special cases.
- Set system properties and environment variables used by the JVM.
- Modify Spark configuration parameters.
Databricks supports two kinds of init scripts: cluster-scoped and global.
Cluster-scoped: run on every cluster configured with the script. This is the recommended way to run an init script.
Global: run on every cluster in the workspace. They can help you to enforce consistent cluster configurations across your workspace. Use them carefully because they can cause unanticipated impacts, like library conflicts. Only admin users can create global init scripts.
To manage global init scripts in the current release, you must use the Global Init Scripts API.
Whenever you change any type of init script you must restart all clusters affected by the script.
Cluster-scoped init scripts support the following environment variables:
DB_CLUSTER_ID: the ID of the cluster on which the script is running. See Clusters API.
DB_CONTAINER_IP: the private IP address of the container in which Spark runs. The init script is run inside this container. See SparkNode.
DB_IS_DRIVER: whether the script is running on a driver node.
DB_DRIVER_IP: the IP address of the driver node.
DB_INSTANCE_TYPE: the instance type of the host VM.
DB_CLUSTER_NAME: the name of the cluster the script is executing on.
DB_PYTHON_VERSION: the version of Python used on the cluster. See Python version.
DB_IS_JOB_CLUSTER: whether the cluster was created to run a job. See Create a job.
SPARKPASSWORD: a path to a secret.
For example, if you want to run part of a script only on a driver node, you could write a script like:
echo $DB_IS_DRIVER if [[ $DB_IS_DRIVER = "TRUE" ]]; then <run this part only on driver> else <run this part only on workers> fi <run this part on both driver and workers>
Init script start and finish events are captured in cluster event logs. Details are captured in cluster logs.
Cluster event logs capture two init script events:
INIT_SCRIPTS_FINISHED, indicating which scripts are scheduled for execution and which have completed successfully.
INIT_SCRIPTS_FINISHED also captures execution duration.
Cluster-scoped init scripts are indicated by the key
Cluster event logs do not log init script events for each cluster node; only one node is selected to represent them all.
If cluster log delivery is configured for a cluster, the init script logs are written to
/<cluster-log-path>/<cluster-id>/init_scripts. Logs for each container in the cluster are written to a subdirectory called
init_scripts/<cluster_id>_<container_ip>. For example, if
cluster-log-path is set to
cluster-logs, the path to the logs for a specific container would be:
If the cluster is configured to write logs to DBFS, you can view the logs using the File system utility (dbutils.fs).
Every time a cluster launches, it writes a log to the init script log folder.
Any user who creates a cluster and enables cluster log delivery can view the
stdout output from global init scripts. You should ensure that your global init scripts do not output any sensitive information.
Cluster-scoped init scripts are init scripts defined in a cluster configuration. Cluster-scoped init scripts apply to both clusters you create and those created to run jobs. Since the scripts are part of the cluster configuration, cluster access control lets you control who can change the scripts.
You can configure cluster-scoped init scripts using the UI or by invoking the Clusters API. This section focuses on performing these tasks using the UI. For the API, see Clusters API.
You can add any number of scripts, and the scripts are executed sequentially in the order provided.
If a cluster-scoped init script returns a non-zero exit code, the cluster launch fails. You can troubleshoot cluster-scoped init scripts by configuring cluster log delivery and examining the init script log.
You can put init scripts in a DBFS or GCS directory accessible by a cluster. Cluster-node init scripts in DBFS must be stored in the DBFS root. Databricks does not support storing init scripts in a DBFS directory created by mounting object storage.
This section shows two examples of init scripts.
The following snippets run in a Python notebook create an init script that installs a PostgreSQL JDBC driver.
Create a DBFS directory you want to store the init script in. This example uses
Create a script named
postgresql-install.shin that directory:
dbutils.fs.put("/databricks/scripts/postgresql-install.sh",""" #!/bin/bash wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
Check that the script exists.
In Databricks Runtime 8.4 ML and below, you use the Conda package manager to install Python packages. To install a Python library at cluster initialization, you can use a script like the following:
#!/bin/bash set -ex /databricks/python/bin/python -V . /databricks/conda/etc/profile.d/conda.sh conda activate /databricks/python conda install -y astropy
Starting with Databricks Runtime 9.0, you cannot use conda to install python libraries. See Libraries for instructions on how to install Python packages on a cluster.
You can configure a cluster to run an init script using the UI or API.
- The script must exist at the configured location. If the script doesn’t exist, the cluster will fail to start or be autoscaled up.
- The init script cannot be larger than 64KB. If a script exceeds that size, the cluster will fail to launch and a failure message will appear in the cluster log.
To use the cluster configuration page to configure a cluster to run an init script:
On the cluster configuration page, click the Advanced Options toggle.
At the bottom of the page, click the Init Scripts tab.
In the Destination drop-down, select a destination type of DBFS or GCS. In the example in the preceding section, the destination is DBFS.
Specify a path to the init script.
- If the Destination drop-down value is GCS, your path must begin with:
- If the Destination drop-down value is DBFS, your path must begin with:
- If the Destination drop-down value is GCS, your path must begin with:
To remove a script from the cluster configuration, click the at the right of the script. When you confirm the delete you will be prompted to restart the cluster. Optionally you can delete the script file from the location you uploaded it to.
A global init script runs on every cluster created in your workspace. Global init scripts are useful when you want to enforce organization-wide library configurations or security screens. Only admins can create global init scripts.
To manage global init scripts in the current release, you must use the REST API.
Use global init scripts carefully:
- It is easy to add libraries or make other modifications that cause unanticipated impacts. Whenever possible, use cluster-scoped init scripts instead.
- Any user who creates a cluster and enables cluster log delivery can view the
stdoutoutput from global init scripts. You should ensure that your global init scripts do not output any sensitive information.
Admins can add, delete, re-order, and get information about the global init scripts in your workspace using the Global Init Scripts API.