Create a bundle manually
In this tutorial, you create a Databricks Asset Bundle from scratch. This simple bundle consists of two notebooks and the definition of a Databricks job to run these notebooks. You then validate, deploy, and run the job in your Databricks workspace. These steps automate the quickstart titled Create your first workflow with a Databricks job.
Requirements
Databricks CLI version 0.218.0 or above. To check your installed version of the Databricks CLI, run the command
databricks -v
. To install the Databricks CLI, see Install or update the Databricks CLI.Authentication configured for the Databricks CLI. See Authentication for the Databricks CLI.
The remote Databricks workspace must have workspace files enabled. See What are workspace files?.
Step 1: Create the bundle
A bundle contains the artifacts you want to deploy and the settings for the resources you want to run.
Create or identify an empty directory on your development machine.
Switch to the empty directory in your terminal or open it in your IDE.
Tip
You can also use a directory containing a repository cloned from a Git provider. This enables you to manage your bundle with external version control and more easily collaborate with other developers and IT professionals on your project.
If you choose to clone a repo for this demo, Databricks recommends that the repo is empty or has only basic files in it such as README
and .gitignore
. Otherwise, any pre-existing files in the repo might be unnecessarily synchronized to your Databricks workspace.
Step 2: Add notebooks to the project
In this step, you add two notebooks to your project. The first notebook gets a list of trending baby names since 2007 from the New York State Department of Health’s public data sources. See Baby Names: Trending by Name: Beginning 2007 on the department’s website. The first notebook then saves this data to your Databricks Unity Catalog volume named my-volume
in a schema named default
in a catalog named main
. The second notebook queries the saved data and displays aggregated counts of the baby names by first name and sex for 2014.
From the directory’s root, create the first notebook, a file named
retrieve-baby-names.py
.Add the following code to the
retrieve-baby-names.py
file:# Databricks notebook source import requests response = requests.get('http://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv') csvfile = response.content.decode('utf-8') dbutils.fs.put("/Volumes/main/default/my-volume/babynames.csv", csvfile, True)
Create the second notebook, a file named
filter-baby-names.py
, in the same directory.Add the following code to the
filter-baby-names.py
file:# Databricks notebook source babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/main/default/my-volume/babynames.csv") babynames.createOrReplaceTempView("babynames_table") years = spark.sql("select distinct(Year) from babynames_table").toPandas()['Year'].tolist() years.sort() dbutils.widgets.dropdown("year", "2014", [str(x) for x in years]) display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))
Step 3: Add a bundle configuration schema file to the project
If you are using an IDE such as Visual Studio Code, PyCharm Professional, or IntelliJ IDEA Ultimate that supports YAML files and JSON schema files, you can use your IDE to not only create the bundle configuration schema file but to check your project’s bundle configuration file syntax and formatting. While the bundle configuration file you create later in Step 5 is YAML-based, the bundle configuration schema file in this step is JSON-based.
Add YAML language server support to Visual Studio Code, for example by installing the YAML extension from the Visual Studio Code Marketplace.
Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the
bundle schema
command and redirect the output to a JSON file. For example, generate a file namedbundle_config_schema.json
in the current directory, as follows:databricks bundle schema > bundle_config_schema.json
In Step 5 you will add the following comment to the beginning of your bundle configuration file, which associates your bundle configuration file with the specified JSON schema file:
# yaml-language-server: $schema=bundle_config_schema.json
Note
In the preceding comment, if your Databricks Asset Bundle configuration JSON schema file is in a different path, replace
bundle_config_schema.json
with the full path to your schema file.
Generate the Databricks Asset Bundle configuration JSON schema file using the Databricks CLI to run the
bundle schema
command and redirect the output to a JSON file. For example, generate a file namedbundle_config_schema.json
in the current directory, as follows:databricks bundle schema > bundle_config_schema.json
Configure PyCharm to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in Configure a custom JSON schema.
In Step 5 you will use PyCharm to create or open a bundle configuration file. By convention, this file is named
databricks.yml
.
Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the
bundle schema
command and redirect the output to a JSON file. For example, generate a file namedbundle_config_schema.json
in the current directory, as follows:databricks bundle schema > bundle_config_schema.json
Configure IntelliJ IDEA to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in Configure a custom JSON schema.
In Step 5 you will use IntelliJ IDEA to create or open a bundle configuration file. By convention, this file is named
databricks.yml
.
Step 4: Set up authentication
In this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Databricks configuration profile named DEFAULT
for authentication.
Note
U2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in Authentication.
Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.
In the following command, replace
<workspace-url>
with your Databricks workspace instance URL, for examplehttps://1234567890123456.7.gcp.databricks.com
.databricks auth login --host <workspace-url>
The Databricks CLI prompts you to save the information that you entered as a Databricks configuration profile. Press
Enter
to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command
databricks auth profiles
. To view a specific profile’s existing settings, run the commanddatabricks auth env --profile <profile-name>
.In your web browser, complete the on-screen instructions to log in to your Databricks workspace.
To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:
databricks auth token --host <workspace-url>
databricks auth token -p <profile-name>
databricks auth token --host <workspace-url> -p <profile-name>
If you have multiple profiles with the same
--host
value, you might need to specify the--host
and-p
options together to help the Databricks CLI find the correct matching OAuth token information.
Step 5: Add a bundle configuration file to the project
In this step, you define how to deploy and run the two notebooks. For this demo, you want to use a Databricks job to run the first notebook and then the second notebook. Because the first notebook saves the data and the second notebook queries the saved data, you want the first notebook to finish running before the second notebook starts. You model these objectives in a bundle configuration file in your project.
From the directory’s root, create the bundle configuration file, a file named
databricks.yml
.Add the following code to the
databricks.yml
file, replacing<workspace-url>
with your workspace URL, for examplehttps://1234567890123456.7.gcp.databricks.com
. This URL must match the one in your.databrickscfg
file:
Tip
The first line, starting with # yaml-language-server
, is required only if your IDE supports it. See Step 3 earlier for details.
# yaml-language-server: $schema=bundle_config_schema.json
bundle:
name: baby-names
resources:
jobs:
retrieve-filter-baby-names-job:
name: retrieve-filter-baby-names-job
job_clusters:
- job_cluster_key: common-cluster
new_cluster:
spark_version: 12.2.x-scala2.12
node_type_id: n2-highmem-4
num_workers: 1
tasks:
- task_key: retrieve-baby-names-task
job_cluster_key: common-cluster
notebook_task:
notebook_path: ./retrieve-baby-names.py
- task_key: filter-baby-names-task
depends_on:
- task_key: retrieve-baby-names-task
job_cluster_key: common-cluster
notebook_task:
notebook_path: ./filter-baby-names.py
targets:
development:
workspace:
host: <workspace-url>
For customizing jobs, the mappings in a job declaration correspond to the request payload, expressed in YAML format, of the create job operation as documented in POST /api/2.1/jobs/create in the REST API reference.
Tip
You can define, combine, and override the settings for new job clusters in bundles by using the techniques described in Override cluster settings in Databricks Asset Bundles.
Step 6: Validate the project’s bundle configuration file
In this step, you check whether the bundle configuration is valid.
Use the Databricks CLI to run the
bundle validate
command, as follows:databricks bundle validate
If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.
If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.
Step 7: Deploy the local project to the remote workspace
In this step, you deploy the two local notebooks to your remote Databricks workspace and create the Databricks job in your workspace.
Use the Databricks CLI to run the
bundle deploy
command as follows:databricks bundle deploy -t development
Check whether the two local notebooks were deployed: In your Databricks workspace’s sidebar, click Workspace.
Click into the Users >
<your-username>
> .bundle > baby-names > development > files folder. The two notebooks should be in this folder.Check whether the job was created: In your Databricks workspace’s sidebar, click Workflows.
On the Jobs tab, click retrieve-filter-baby-names-job.
Click the Tasks tab. There should be two tasks: retrieve-baby-names-task and filter-baby-names-task.
If you make any changes to your bundle after this step, you should repeat steps 6-7 to check whether your bundle configuration is still valid and then redeploy the project.
Step 8: Run the deployed project
In this step, you run the Databricks job in your workspace.
Use the Databricks CLI to run the
bundle run
command, as follows:databricks bundle run -t development retrieve-filter-baby-names-job
Copy the value of
Run URL
that appears in your terminal and paste this value into your web browser to open your Databricks workspace.In your Databricks workspace, after the two tasks complete successfully and show green title bars, click the filter-baby-names-task task to see the query results.
If you make any changes to your bundle after this step, you should repeat steps 6-8 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.
Step 9: Clean up
In this step, you delete the two deployed notebooks and the job from your workspace.
Use the Databricks CLI to run the
bundle destroy
command, as follows:databricks bundle destroy
Confirm the job deletion request: When prompted to permanently destroy resources, type
y
and pressEnter
.Confirm the notebooks deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type
y
and pressEnter
.
Running the bundle destroy
command deletes only the deployed job and the folder containing the two deployed notebooks. This command does not delete any side effects, such as the babynames.csv
file that the first notebook created. To delete the babybnames.csv
file, do the following:
In the sidebar of your Databricks workspace, click Catalog.
Click Browse DBFS.
Click the FileStore folder.
Click the dropdown arrow next to babynames.csv, and click Delete.
If you also want to delete the bundle from your development machine, you can now delete the local directory from Step 1.