Develop a job on Databricks by using Databricks Asset Bundles

Preview

This feature is in Public Preview.

Databricks Asset Bundles, also known simply as bundles, enable you to programmatically validate, deploy, and run Databricks resources such as jobs. You can also use bundles to programmatically manage Delta Live Tables pipelines and work with MLOps Stacks. See What are Databricks Asset Bundles?.

This article describes steps that you can complete from a local development setup to use a bundle that programmatically manages a job. See Introduction to Databricks Workflows.

If you have existing jobs that were created by using the Databricks Workflows user interface or API that you want to move to bundles, then you must recreate them as bundle configuration files. To do so, Databricks recommends that you first create a bundle by using the steps below and the validate whether the bundle works. You can then add job definitions, notebooks, and other sources to the bundle. See Add an existing job definition to a bundle.

In addition to using the Databricks CLI to run a job deployed by a bundle, you can also view and run these jobs in the Databricks Jobs UI. See View and run a job created with a Databricks Asset Bundle.

Requirements

  • Databricks CLI version 0.205 or above. To check your installed version of the Databricks CLI, run the command databricks -v. To install Databricks CLI version 0.205 or above, see Install or update the Databricks CLI.

  • The remote workspace must have workspace files enabled. See What are workspace files?.

Decision: Create the bundle by using a template or manually

Decide whether you want to create the bundle using a template or manually:

Create the bundle by using a template

In these steps, you create the bundle by using the Databricks default bundle template for Python, which consists of a notebook or Python code, paired with the definition of a job to run it. You then validate, deploy, and run the deployed job within your Databricks workspace.

Step 1: Set up authentication

In this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Databricks configuration profile named DEFAULT for authentication.

Note

U2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in Authentication.

  1. Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.

    In the following command, replace <workspace-url> with your Databricks workspace instance URL, for example https://1234567890123456.7.gcp.databricks.com.

    databricks auth login --host <workspace-url>
    
  2. The Databricks CLI prompts you to save the information that you entered as a Databricks configuration profile. Press Enter to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.

    To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command databricks auth profiles. To view a specific profile’s existing settings, run the command databricks auth env --profile <profile-name>.

  3. In your web browser, complete the on-screen instructions to log in to your Databricks workspace.

  4. To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:

    • databricks auth token --host <workspace-url>

    • databricks auth token -p <profile-name>

    • databricks auth token --host <workspace-url> -p <profile-name>

    If you have multiple profiles with the same --host value, you might need to specify the --host and -p options together to help the Databricks CLI find the correct matching OAuth token information.

Step 2: Create the bundle

A bundle contains the artifacts you want to deploy and the settings for the resources you want to run.

  1. Use your terminal or command prompt to switch to a directory on your local development machine that will contain the template’s generated bundle.

  2. Use the Dataricks CLI to run the bundle init command:

    databricks bundle init
    
  3. For Template to use, leave the default value of default-python by pressing Enter.

  4. For Unique name for this project, leave the default value of my_project, or type a different value, and then press Enter. This determines the name of the root directory for this bundle. This root directory is created within your current working directory.

  5. If you want your bundle to include a sample notebook, for Include a stub (sample) notebook, leave the default value of yes by pressing Enter. This creates a sample notebook in the src directory within your bundle.

  6. For Include a stub (sample) DLT pipeline, select no and press Enter. This instructs the Databricks CLI to not define a sample Delta Live Tables pipeline in your bundle.

  7. For Include a stub (sample) Python package, select no and press Enter. This instructs the Databricks CLI to not add sample Python wheel package files or related build instructions to your bundle.

Step 3: Explore the bundle

To view the files that the template generated, switch to the root directory of your newly created bundle and open this directory with your preferred IDE, for example Visual Studio Code. Files of particular interest include the following:

  • databricks.yml: This file specifies the bundle’s programmatic name, includes a reference to the job definition, and specifies settings about the target workspace.

  • resources/<project-name>_job.yml: This file specifies the job’s settings.

  • src/notebook.ipynb: This file is a notebook that, when run, simply initializes an RDD that contains the numbers 1 through 10.

For customizing jobs, the mappings within a job declaration correspond to the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format.

Tip

You can define, combine, and override the settings for new job clusters in bundles by using the techniques described in Override cluster settings in Databricks Asset Bundles.

Step 4: Validate the project’s bundle configuration file

In this step, you check whether the bundle configuration is valid.

  1. From the root directory, use the Databricks CLI to run the bundle validate command, as follows:

    databricks bundle validate
    
  2. If a JSON representation of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.

If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.

Step 5: Deploy the local project to the remote workspace

In this step, you deploy the local notebook to your remote Databricks workspace and create the Databricks job within your workspace.

  1. From the bundle root, use the Databricks CLI to run the bundle deploy command as follows:

    databricks bundle deploy -t dev
    
  2. Check whether the local notebook was deployed: In your Databricks workspace’s sidebar, click Workspace.

  3. Click into the Users > <your-username> > .bundle > <project-name> > dev > files > src folder. The notebook should be in this folder.

  4. Check whether the job was created: In your Databricks workspace’s sidebar, click Workflows.

  5. On the Jobs tab, click [dev <your-username>] <project-name>_job.

  6. Click the Tasks tab. There should be one task: notebook_task.

If you make any changes to your bundle after this step, you should repeat steps 4-5 to check whether your bundle configuration is still valid and then redeploy the project.

Step 6: Run the deployed project

In this step, you run the Databricks job in your workspace.

  1. From the root directory, use the Databricks CLI to run the bundle run command, as follows, replacing <project-name> with the name of your project from Step 2:

    databricks bundle run -t dev <project-name>_job
    
  2. Copy the value of Run URL that appears in your terminal and paste this value into your web browser to open your Databricks workspace.

  3. In your Databricks workspace, after the job task completes successfully and shows a green title bar, click the job task to see the results.

If you make any changes to your bundle after this step, you should repeat steps 4-6 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.

Step 7: Clean up

In this step, you delete the deployed notebook and the job from your workspace.

  1. From the root directory, use the Databricks CLI to run the bundle destroy command, as follows:

    databricks bundle destroy
    
  2. Confirm the job deletion request: When prompted to permanently destroy resources, type y and press Enter.

  3. Confirm the notebook deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type y and press Enter.

  4. If you also want to delete the bundle from your development machine, you can now delete the local directory from Step 2.

You have reached the end of the steps for creating a bundle by using a template.

Create the bundle manually

In these steps, you create a bundle from scratch. This simple bundle consists of two notebooks and the definition of a Databricks job to run these notebooks. You then validate, deploy, and run the deployed notebooks from the job within your Databricks workspace. These steps automate the quickstart titled Create your first workflow with a Databricks job.

Step 1: Create the bundle

A bundle contains the artifacts you want to deploy and the settings for the resources you want to run.

  1. Create or identify an empty directory on your development machine.

  2. Switch to the empty directory in your terminal, or open the empty directory in your IDE.

Tip

Your empty directory could be associated with a cloned repository that is managed by a Git provider. This enables you to manage your bundle with external version control and to more easily collaborate with other developers and IT professionals on your project. However, to help simplify this demonstration, a cloned repo is not used here.

If you choose to clone a repo for this demo, Databricks recommends that the repo is empty or has only basic files in it such as README and .gitignore. Otherwise, any pre-existing files in the repo might be unnecessarily synchronized to your Databricks workspace.

Step 2: Add notebooks to the project

In this step, you add two notebooks to your project. The first notebook gets a list of trending baby names since 2007 from the New York State Department of Health’s public data sources. See Baby Names: Trending by Name: Beginning 2007 on the department’s website. This first notebook then saves this data within your Databricks workspace’s FileStore folder in DBFS. The second notebook queries the saved data and displays aggregated counts of the baby names by first name and sex for 2014.

  1. From the directory’s root, create the first notebook, a file named retrieve-baby-names.py.

  2. Add the following code to the retrieve-baby-names.py file:

    # Databricks notebook source
    import requests
    
    response = requests.get('http://health.data.ny.gov/api/views/myeu-hzra/rows.csv')
    csvfile = response.content.decode('utf-8')
    dbutils.fs.put("dbfs:/FileStore/babynames.csv", csvfile, True)
    
  3. Create the second notebook, a file named filter-baby-names.py, in the same directory.

  4. Add the following code to the filter-baby-names.py file:

    # Databricks notebook source
    babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/babynames.csv")
    babynames.createOrReplaceTempView("babynames_table")
    years = spark.sql("select distinct(Year) from babynames_table").rdd.map(lambda row : row[0]).collect()
    years.sort()
    dbutils.widgets.dropdown("year", "2014", [str(x) for x in years])
    display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))
    

Step 3: Add a bundle configuration schema file to the project

If you are using an IDE such as Visual Studio Code, PyCharm Professional, or IntelliJ IDEA Ultimate that provides support for YAML files and JSON schema files, you can use your IDE to not only create the bundle configuration schema file but to check your project’s bundle configuration file syntax and formatting and provide code completion hints, as follows. Note that while the bundle configuration file that you will create later in Step 5 is YAML-based, the bundle configuration schema file in this step is JSON-based.

  1. Add YAML language server support to Visual Studio Code, for example by installing the YAML extension from the Visual Studio Code Marketplace.

  2. Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the bundle schema command and redirect the output to a JSON file. For example, generate a file named bundle_config_schema.json within the current directory, as follows:

    databricks bundle schema > bundle_config_schema.json
    
  3. Note that later in Step 5, you will add the following comment to the beginning of your bundle configuration file, which associates your bundle configuration file with the specified JSON schema file:

    # yaml-language-server: $schema=bundle_config_schema.json
    

    Note

    In the preceding comment, if your Databricks Asset Bundle configuration JSON schema file is in a different path, replace bundle_config_schema.json with the full path to your schema file.

  1. Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the bundle schema command and redirect the output to a JSON file. For example, generate a file named bundle_config_schema.json within the current directory, as follows:

    databricks bundle schema > bundle_config_schema.json
    
  2. Configure PyCharm to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in Configure a custom JSON schema.

  3. Note that later in Step 5, you will use PyCharm to create or open a bundle configuration file. By convention, this file is named databricks.yml.

  1. Generate the Databricks Asset Bundle configuration JSON schema file by using the Databricks CLI to run the bundle schema command and redirect the output to a JSON file. For example, generate a file named bundle_config_schema.json within the current directory, as follows:

    databricks bundle schema > bundle_config_schema.json
    
  2. Configure IntelliJ IDEA to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in Configure a custom JSON schema.

  3. Note that later in Step 5, you will use IntelliJ IDEA to create or open a bundle configuration file. By convention, this file is named databricks.yml.

Step 4: Set up authentication

In this step, you set up authentication between the Databricks CLI on your development machine and your Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Databricks configuration profile named DEFAULT for authentication.

Note

U2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in Authentication.

  1. Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.

    In the following command, replace <workspace-url> with your Databricks workspace instance URL, for example https://1234567890123456.7.gcp.databricks.com.

    databricks auth login --host <workspace-url>
    
  2. The Databricks CLI prompts you to save the information that you entered as a Databricks configuration profile. Press Enter to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.

    To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command databricks auth profiles. To view a specific profile’s existing settings, run the command databricks auth env --profile <profile-name>.

  3. In your web browser, complete the on-screen instructions to log in to your Databricks workspace.

  4. To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:

    • databricks auth token --host <workspace-url>

    • databricks auth token -p <profile-name>

    • databricks auth token --host <workspace-url> -p <profile-name>

    If you have multiple profiles with the same --host value, you might need to specify the --host and -p options together to help the Databricks CLI find the correct matching OAuth token information.

Step 5: Add a bundle configuration file to the project

In this step, you define how you want to deploy and run the two notebooks. For this demo, you want to use a Databricks job to run the first notebook and then the second notebook. Because the first notebook saves the data and the second notebook queries the saved data, you want the first notebook to finish running before the second notebook starts. You model these objectives within a bundle configuration file in your project.

  1. From the directory’s root, create the bundle configuration file, a file named databricks.yml.

  2. Add the following code to the databricks.yml file, replacing <workspace-url> with your workspace URL, for example https://1234567890123456.7.gcp.databricks.com. This URL must match the one in your .databrickscfg file:

Tip

The first line, starting with # yaml-language-server, is required only if your IDE supports it. See Step 3 earlier for details.

# yaml-language-server: $schema=bundle_config_schema.json
bundle:
  name: baby-names

resources:
  jobs:
    retrieve-filter-baby-names-job:
      name: retrieve-filter-baby-names-job
      job_clusters:
        - job_cluster_key: common-cluster
          new_cluster:
            spark_version: 12.2.x-scala2.12
            node_type_id: n2-highmem-4
            num_workers: 1
      tasks:
        - task_key: retrieve-baby-names-task
          job_cluster_key: common-cluster
          notebook_task:
            notebook_path: ./retrieve-baby-names.py
        - task_key: filter-baby-names-task
          depends_on:
            - task_key: retrieve-baby-names-task
          job_cluster_key: common-cluster
          notebook_task:
            notebook_path: ./filter-baby-names.py

targets:
  development:
    workspace:
      host: <workspace-url>

For customizing jobs, the mappings within a job declaration correspond to the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format.

Tip

You can define, combine, and override the settings for new job clusters in bundles by using the techniques described in Override cluster settings in Databricks Asset Bundles.

Step 6: Validate the project’s bundle configuration file

In this step, you check whether the bundle configuration is valid.

  1. Use the Databricks CLI to run the bundle validate command, as follows:

    databricks bundle validate
    
  2. If a JSON representation of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.

If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.

Step 7: Deploy the local project to the remote workspace

In this step, you deploy the two local notebooks to your remote Databricks workspace and create the Databricks job within your workspace.

  1. Use the Databricks CLI to run the bundle deploy command as follows:

    databricks bundle deploy -t development
    
  2. Check whether the two local notebooks were deployed: In your Databricks workspace’s sidebar, click Workspace.

  3. Click into the Users > <your-username> > .bundle > baby-names > development > files folder. The two notebooks should be in this folder.

  4. Check whether the job was created: In your Databricks workspace’s sidebar, click Workflows.

  5. On the Jobs tab, click retrieve-filter-baby-names-job.

  6. Click the Tasks tab. There should be two tasks: retrieve-baby-names-task and filter-baby-names-task.

If you make any changes to your bundle after this step, you should repeat steps 6-7 to check whether your bundle configuration is still valid and then redeploy the project.

Step 8: Run the deployed project

In this step, you run the Databricks job in your workspace.

  1. Use the Databricks CLI to run the bundle run command, as follows:

    databricks bundle run -t development retrieve-filter-baby-names-job
    
  2. Copy the value of Run URL that appears in your terminal and paste this value into your web browser to open your Databricks workspace.

  3. In your Databricks workspace, after the two tasks complete successfully and show green title bars, click the filter-baby-names-task task to see the query results.

If you make any changes to your bundle after this step, you should repeat steps 6-8 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.

Step 9: Clean up

In this step, you delete the two deployed notebooks and the job from your workspace.

  1. Use the Databricks CLI to run the bundle destroy command, as follows:

    databricks bundle destroy
    
  2. Confirm the job deletion request: When prompted to permanently destroy resources, type y and press Enter.

  3. Confirm the notebooks deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type y and press Enter.

Running the bundle destroy command deletes only the deployed job and the folder containing the two deployed notebooks. This command does not delete any side effects, such as the babynames.csv file that the first notebook created. To delete the babybnames.csv file, do the following:

  1. In the sidebar of your Databricks workspace, click Catalog.

  2. Click Browse DBFS.

  3. Click the FileStore folder.

  4. Click the dropdown arrow next to babynames.csv, and click Delete.

  5. If you also want to delete the bundle from your development machine, you can now delete the local directory from Step 1.

Add an existing job definition to a bundle

You can use an existing job definition as a basis to define a new job in a bundle configuration file. To do this, complete the following steps.

Note

The following steps create a new job that has the same settings as the existing job. However, the new job has a different job ID than the existing job. You cannot automatically import an existing job ID into a bundle.

Step 1: Get the existing job definition in YAML format

In this step, use the Databricks workspace user interface to get the YAML representation of the existing job definition.

  1. In your Databricks workspace’s sidebar, click Workflows.

  2. On the Jobs tab, click your job’s Name link.

  3. Next to the Run now button, click the ellipses, and then click View YAML.

  4. On the Create tab, copy the job definition’s YAML to your local clipboard by clicking Copy.

Step 2: Add the job definition YAML to a bundle configuration file

In your bundle configuration file, add the YAML that you copied from the previous step to one of the following locations labelled <job-yaml-can-go-here> in your bundle configuration files, as follows:

resources:
  jobs:
    <some-unique-programmatic-identifier-for-this-job>:
      <job-yaml-can-go-here>

targets:
  <some-unique-programmatic-identifier-for-this-target>:
    resources:
      jobs:
        <some-unique-programmatic-identifier-for-this-job>:
          <job-yaml-can-go-here>

Step 3: Add notebooks, Python files, and other artifacts to the bundle

Any Python files and notebooks that are referenced in the existing job should be moved to the bundle’s sources.

For better compatibility with bundles, notebooks should use the IPython notebook format (.ipynb). If you develop the bundle locally, you can export an existing notebook from a Databricks workspace into the .ipynb format by clicking File > Export > IPython Notebook from the Databricks notebook user interface. By convention, you should then put the downloaded notebook into the src/ directory in your bundle.

After you add your notebooks, Python files, and other artifacts to the bundle, make sure that your job definition references them. For example, for a notebook with the filename of hello.ipynb that is in a src/ directory, and the src/ directory is in the same folder as the bundle configuration file that references the src/ directory, the job definition might be expressed as follows:

resources:
  jobs:
    hello-job:
      name: hello-job
      tasks:
      - task_key: hello-task
        notebook_task:
          notebook_path: ./src/hello.ipynb

Step 4: Validate, deploy, and run the new job

  1. Validate that the bundle’s configuration files are syntactically correct by running the following command:

    databricks bundle validate
    
  2. Deploy the bundle by running the following command. In this command, replace <target-identifier> with the unique programmatic identifier for the target from the bundle configuration:

    databricks bundle deploy -t <target-identifier>
    
  3. Run the job by running the following command. In this command, replace the following:

    • Replace <target-identifier> with the unique programmatic identifier for the target from the bundle configuration.

    • Replace <job-identifier> with unique programmatic identifier for the job from the bundle configuration.

    databricks bundle run -t <target-identifier> <job-identifier>