Create clusters, notebooks, and jobs with Terraform
This article shows how to use the Databricks Terraform provider to create a cluster, a notebook, and a job in an existing Databricks workspace.
This article is a companion to the following Databricks getting started articles:
Run your first ETL workload on Databricks, which uses a general-purpose cluster, a Python notebook, and a job to run the notebook.
Tutorial: Query data with notebooks, which uses a general-purpose cluster and a SQL notebook.
You can also adapt the Terraform configurations in this article to create custom clusters, notebooks, and jobs in your workspaces.
Requirements
A Databricks workspace.
A Databricks personal access token, to allow Terraform to call the Databricks APIs within your Databricks workspace. See also Manage personal access tokens.
On your local development machine, you must have:
The Terraform CLI. See Download Terraform on the Terraform website.
One of the following:
Databricks CLI version 0.205 or above, configured with your Databricks personal access token by running
databricks configure --host <workspace-url> --profile <some-unique-profile-name>
. See Install or update the Databricks CLI and Databricks personal access token authentication.The following two Databricks environment variables:
DATABRICKS_HOST
, set to the value of your Databricks workspace instance URL, for examplehttps://1234567890123456.7.gcp.databricks.com
DATABRICKS_TOKEN
, set to the value of your Databricks personal access token. See also Manage personal access tokens.
To set these environment variables, see your operating system’s documentation.
Note
As a security best practice, when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
Step 1: Set up the Terraform project
In this step, you set up a Terraform project to define the settings for Terraform to authenticate with your workspace. You also define the settings for the resources that Terraform deploys to your workspace.
Create an empty directory and then switch to it. This directory contains your Terraform project files. (Each separate set of Terraform project files must be in its own parent directory.) To do this, in your terminal or PowerShell, run a command like the following:
mkdir terraform_cluster_notebook_job && cd terraform_cluster_notebook_job
In this empty directory, create a file named
auth.tf
, and add the following content to the file. This configuration initializes the Databricks Terraform provider and authenticates Terraform with your workspace.# Initialize the Databricks Terraform provider. terraform { required_providers { databricks = { source = "databricks/databricks" } } } # Use environment variables for authentication. provider "databricks" {} # Retrieve information about the current user. data "databricks_current_user" "me" {}
With authentication configuration completed, you are now ready to initialize your Terraform project and begin defining Databricks resources.
Run the
terraform init
command. This command initializes your Terraform project by creating additional helper files and downloading the necessary Terraform modules.terraform init
If you are creating a cluster, create another file named
cluster.tf
, and add the following content to the file. This content creates a cluster with the smallest amount of resources allowed. This cluster uses the lastest Databricks Runtime Long Term Support (LTS) version.variable "cluster_name" { description = "A name for the cluster." type = string default = "My Cluster" } variable "cluster_autotermination_minutes" { description = "How many minutes before automatically terminating due to inactivity." type = number default = 60 } variable "cluster_num_workers" { description = "The number of workers." type = number default = 1 } # Create the cluster with the "smallest" amount # of resources allowed. data "databricks_node_type" "smallest" { local_disk = true } # Use the latest Databricks Runtime # Long Term Support (LTS) version. data "databricks_spark_version" "latest_lts" { long_term_support = true } resource "databricks_cluster" "this" { cluster_name = var.cluster_name node_type_id = data.databricks_node_type.smallest.id spark_version = data.databricks_spark_version.latest_lts.id autotermination_minutes = var.cluster_autotermination_minutes num_workers = var.cluster_num_workers } output "cluster_url" { value = databricks_cluster.this.url }
If you are creating the cluster, create another file named
cluster.auto.tfvars
, and add the following content to the file. This file contains variable values for customizing the cluster. Replace the placeholder values with your own values.cluster_name = "My Cluster" cluster_autotermination_minutes = 60 cluster_num_workers = 1
If you are creating a notebook, create another file named
notebook.tf
, and add the following content to the file:variable "notebook_subdirectory" { description = "A name for the subdirectory to store the notebook." type = string default = "Terraform" } variable "notebook_filename" { description = "The notebook's filename." type = string } variable "notebook_language" { description = "The language of the notebook." type = string } resource "databricks_notebook" "this" { path = "${data.databricks_current_user.me.home}/${var.notebook_subdirectory}/${var.notebook_filename}" language = var.notebook_language source = "./${var.notebook_filename}" } output "notebook_url" { value = databricks_notebook.this.url }
Save the following notebook code to a file in the same directory as the
notebook.tf
file:For the Python notebook for Run your first ETL workload on Databricks, a file named
notebook-getting-started-etl-quick-start.py
with the following contents:# Databricks notebook source # Import functions from pyspark.sql.functions import col, current_timestamp # Define variables used in code below file_path = "/databricks-datasets/structured-streaming/events" username = spark.sql("SELECT regexp_replace(current_user(), '[^a-zA-Z0-9]', '_')").first()[0] table_name = f"{username}_etl_quickstart" checkpoint_path = f"/tmp/{username}/_checkpoint/etl_quickstart" # Clear out data from previous demo execution spark.sql(f"DROP TABLE IF EXISTS {table_name}") dbutils.fs.rm(checkpoint_path, True) # Configure Auto Loader to ingest JSON data to a Delta table (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", checkpoint_path) .load(file_path) .select("*", col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time")) .writeStream .option("checkpointLocation", checkpoint_path) .trigger(availableNow=True) .toTable(table_name)) # COMMAND ---------- df = spark.read.table(table_name) # COMMAND ---------- display(df)
For the SQL notebook for Tutorial: Query data with notebooks, a file named
notebook-getting-started-quick-start.sql
with the following contents:-- Databricks notebook source -- MAGIC %python -- MAGIC diamonds = (spark.read -- MAGIC .format("csv") -- MAGIC .option("header", "true") -- MAGIC .option("inferSchema", "true") -- MAGIC .load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv") -- MAGIC ) -- MAGIC -- MAGIC diamonds.write.format("delta").save("/mnt/delta/diamonds") -- COMMAND ---------- DROP TABLE IF EXISTS diamonds; CREATE TABLE diamonds USING DELTA LOCATION '/mnt/delta/diamonds/' -- COMMAND ---------- SELECT color, avg(price) AS price FROM diamonds GROUP BY color ORDER BY COLOR
If you are creating the notebook, create another file named
notebook.auto.tfvars
, and add the following content to the file. This file contains variable values for customizing the notebook configuration.For the Python notebook for Run your first ETL workload on Databricks:
notebook_subdirectory = "Terraform" notebook_filename = "notebook-getting-started-etl-quick-start.py" notebook_language = "PYTHON"
For the SQL notebook for Tutorial: Query data with notebooks:
notebook_subdirectory = "Terraform" notebook_filename = "notebook-getting-started-quickstart.sql" notebook_language = "SQL"
If you are creating a notebook, in your Databricks workspace, be sure to set up any requirements for the notebook to run successfully, by referring to the following instructions for:
The Python notebook for Run your first ETL workload on Databricks
The SQL notebook for Tutorial: Query data with notebooks
If you are creating a job, create another file named
job.tf
, and add the following content to the file. This content creates a job to run the notebook.variable "job_name" { description = "A name for the job." type = string default = "My Job" } resource "databricks_job" "this" { name = var.job_name existing_cluster_id = databricks_cluster.this.cluster_id notebook_task { notebook_path = databricks_notebook.this.path } email_notifications { on_success = [ data.databricks_current_user.me.user_name ] on_failure = [ data.databricks_current_user.me.user_name ] } } output "job_url" { value = databricks_job.this.url }
If you are creating the job, create another file named
job.auto.tfvars
, and add the following content to the file. This file contains a variable value for customizing the job configuration.job_name = "My Job"
Step 2: Run the configurations
In this step, you run the Terraform configurations to deploy the cluster, the notebook, and the job into your Databricks workspace.
Check to see whether your Terraform configurations are valid by running the
terraform validate
command. If any errors are reported, fix them, and run the command again.terraform validate
Check to see what Terraform will do in your workspace, before Terraform actually does it, by running the
terraform plan
command.terraform plan
Deploy the cluster, the notebook, and the job into your workspace by running the
terraform apply
command. When prompted to deploy, typeyes
and press Enter.terraform apply
Terraform deploys the resources that are specified in your project. Deploying these resources (especially a cluster) can take several minutes.
Step 3: Explore the results
If you created a cluster, in the output of the
terraform apply
command, copy the link next tocluster_url
, and paste it into your web browser’s address bar.If you created a notebook, in the output of the
terraform apply
command, copy the link next tonotebook_url
, and paste it into your web browser’s address bar.Note
Before you use the notebook, you might need to customize its contents. See the related documentation about how to customize the notebook.
If you created a job, in the output of the
terraform apply
command, copy the link next tojob_url
, and paste it into your web browser’s address bar.Note
Before you run the notebook, you might need to customize its contents. See the links at the beginning of this article for related documentation about how to customize the notebook.
If you created a job, run the job as follows:
Click Run now on the job page.
After the job finishes running, to view the job run’s results, in the Completed runs (past 60 days) list on the job page, click the most recent time entry in the Start time column. The Output pane shows the result of running the notebook’s code.
Step 4: Clean up
In this step, you delete the preceding resources from your workspace.
Check to see what Terraform will do in your workspace, before Terraform actually does it, by running the
terraform plan
command.terraform plan
Delete the cluster, the notebook, and the job from your workspace by running the
terraform destroy
command. When prompted to delete, typeyes
and press Enter.terraform destroy
Terraform deletes the resources that are specified in your project.