CI/CD techniques with Git and Databricks Git folders (Repos)

Learn techniques for using Databricks Git folders in CI/CD workflows. By configuring Databricks Git folders in the workspace, you can use source control for project files in Git repositories and you can integrate them into your data engineering pipelines.

The following figure shows an overview of the techniques and workflow.

Overview of CI/CD techniques for Git folders.

For an overview of CI/CD with Databricks, see What is CI/CD on Databricks?.

Development flow

Databricks Git folders have user-level folders. User-level folders are automatically created when users first clone a remote repository. You can think of Databricks Git folders in user folders as “local checkouts” that are individual for each user and where users make changes to their code.

In your user folder in Databricks Git folders, clone your remote repository. A best practice is to create a new feature branch or select a previously created branch for your work, instead of directly committing and pushing changes to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to merge your code, you can do so in the Git folders UI.

Requirements

This workflow requires that you have already set up your Git integration.

Note

Databricks recommends that each developer works on their own feature branch. For information about how to resolve merge conflicts, see Resolve merge conflicts.

Collaborate in Git folders

The following workflow uses a branch called feature-b that is based on the main branch.

Clone your existing Git repository to your Databricks workspace.
Use the Git folders UI to create a feature branch from the main branch. This example uses a single feature branch feature-b for simplicity. You can create and use multiple feature branches to do your work.
Make your modifications to Databricks notebooks and other files in the repo.
Commit and push your changes to your Git provider.
Contributors can now clone the Git repository into their own user folder.
1. Working on a new branch, a coworker makes changes to the notebooks and other files in the Git folder.
2. The contributor commits and pushes their changes to the Git provider.
To merge changes from other branches or rebase the feature-b branch in Databricks, in the Git folders UI use one of the following workflows:
- Merge branches. If there’s no conflict, the merge is pushed to the remote Git repository using git push.
- Rebase on another branch.
When you are ready to merge your work to the remote Git repository and main branch, use the Git folders UI to merge the changes from feature-b. If you prefer, you can instead merge changes directly to the Git repository backing your Git folder.

Production job workflow

Databricks Git folders provides two options for running your production jobs:

Option 1: Provide a remote Git reference in the job definition. For example, run a specific notebook in the main branch of a Git repository.
Option 2: Set up a production Git repository and call Repos APIs to update it programmatically. Run jobs against the Databricks Git folder that clones this remote repository. The Repos API call should be the first task in the job.

Option 1: Run jobs using notebooks in a remote repository

Simplify the job definition process and keep a single source of truth by running a Databricks job using notebooks located in a remote Git repository. This Git reference can be a Git commit, tag, or branch and is provided by you in the job definition.

This helps prevent unintentional changes to your production job, such as when a user makes local edits in a production repository or switches branches. It also automates the CD step as you do not need to create a separate production Git folder in Databricks, manage permissions for it, and keep it updated.

See Use Git with jobs.

Option 2: Set up a production Git folder and Git automation

In this option, you set up a production Git folder and automation to update the Git folder on merge.

Step 1: Set up top-level folders

The admin creates non-user top-level folders. The most common use case for these top-level folders is to create development, staging, and production folders that contain Databricks Git folders for the appropriate versions or branches for development, staging, and production. For example, if your company uses the main branch for production, the “production” Git folder must have the main branch checked out in it.

Typically permissions on these top-level folders are read-only for all non-admin users within the workspace. For such top-level folders we recommend you only provide service principal(s) with CAN EDIT and CAN MANAGE permissions to avoid accidental edits to your production code by workspace users.

Step 2: Set up automated updates to Databricks Git folders with the Git folders API

To keep a Git folder in Databricks at the latest version, you can set up Git automation to call the Repos API. In your Git provider, set up automation that, after every successful merge of a PR into the main branch, calls the Repos API endpoint on the appropriate Git folder to update it to the latest version.

For example, on GitHub this can be achieved with GitHub Actions. For more information, see the Repos API.

Use a service principal for automation with Databricks Git folders

You can use either the Databricks account console or the Databricks CLI to create a service principal authorized to access your workspace’s Git folders.

To create a new service principal, see Manage service principals. When you have a service principal in your workspace, you can link your Git credentials to it so it can access your workspace’s Git folders as part of your automation.

Authorize a service principal to access Git folders

To provide authorized access to your Git folders for a service principal using the Databricks account console:

Log in to your Databricks workspace. You must have administrator privileges to your workspace to complete these steps. If you do not have administrator privileges for your workspace, request them or contact your account administrator.
In the upper-right corner of any page, click your username, then select Settings.
Select Identity and access under Workspace admin in the left navigation pane, and then select the Manage button for Service Principals.
From the list of service principals, select the one you want to update with Git credentials. You can also create a new service principal by selecting Add service principal.
Select the Git integration tab. (If you didn’t create the service principal or haven’t been assigned the service principal manager privilege on it, it will be greyed out.) Under it, choose the Git provider for the credentials (such as GitHub), select Link Git account, and then select Link.

You can also use a Git personal access token (PAT) if you do not want to link your own Git credentials. To use a PAT instead, select Personal access token and provide the token information for the Git account to use when authenticating the service principal’s access. For more details on acquiring a PAT from a Git provider, see Configure Git credentials & connect a remote repo to Databricks.
You will be prompted to select the Git user account to link. Choose the Git user account the service principal will use for access and select Continue. (If you do not see the user account you want to use, select Use a different account.)
In the next dialog, select Authorize Databricks. You will briefly see the message “Linking account…” and then the updated service principal details.

The service principal you chose will now apply the linked Git credentials when accessing your Databricks workspace Git folder resources as part of your automation.

Terraform integration

You can also manage Databricks Git folders in a fully automated setup using Terraform and databricks_repo:

resource "databricks_repo" "this" {
  url = "https://github.com/user/demo.git"
}

To use Terraform to add Git credentials to a service principal, add the following configuration:

  provider "databricks" {
    # Configuration options
  }

  provider "databricks" {
    alias = "sp"
    host = "https://....cloud.databricks.com"
    token = databricks_obo_token.this.token_value
  }

  resource "databricks_service_principal" "sp" {
    display_name = "service_principal_name_here"
  }

  resource "databricks_obo_token" "this" {
    application_id   = databricks_service_principal.sp.application_id
    comment          = "PAT on behalf of ${databricks_service_principal.sp.display_name}"
    lifetime_seconds = 3600
  }

  resource "databricks_git_credential" "sp" {
    provider = databricks.sp
    depends_on = [databricks_obo_token.this]
    git_username          = "myuser"
    git_provider          = "azureDevOpsServices"
    personal_access_token = "sometoken"
  }

Configure an automated CI/CD pipeline with Databricks Git folders

Here is a simple automation that can be run as a GitHub Action.

Requirements

You have created a Git folder in a Databricks workspace that is tracking the base branch being merged into.
You have a Python package that creates the artifacts to place into a DBFS location. Your code must:
- Update the repository associated with your preferred branch (such as development) to contain the latest versions of your notebooks.
- Build any artifacts and copy them to the library path.
- Replace the last versions of build artifacts to avoid having to manually update artifact versions in your job.

Create an automated CI/CD workflow

Set up secrets so your code can access the Databricks workspace. Add the following secrets to the Github repository:
- DEPLOYMENT_TARGET_URL: Set this to your workspace URL. Do not include the /?o substring.
- DEPLOYMENT_TARGET_TOKEN: Set this to a Databricks Personal Access Token (PAT). You can generate a Databricks PAT by following the instructions in Databricks personal access token authentication.

Navigate to the Actions tab of your Git repository and click the New workflow button. At the top of the page, select Set up a workflow yourself and paste in this script:

The "set up a workflow yourself" link in the GitHub Actions UI

# This is a basic automation workflow to help you get started with GitHub Actions.

name: CI

# Controls when the workflow will run
on:
  # Triggers the workflow on push for main and dev branch
  push:
    paths-ignore:
      - .github
    branches:
      # Set your base branch name here
      - your-base-branch-name

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "deploy"
  deploy:
    # The type of runner that the job will run on
    runs-on: ubuntu-latest
    environment: development
    env:
      DATABRICKS_HOST: ${{ secrets.DEPLOYMENT_TARGET_URL }}
      DATABRICKS_TOKEN:  ${{ secrets.DEPLOYMENT_TARGET_TOKEN }}
      REPO_PATH: /Workspace/Users/someone@example.com/workspace-builder
      DBFS_LIB_PATH: dbfs:/path/to/libraries/
      LATEST_WHEEL_NAME: latest_wheel_name.whl

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
    # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
    - uses: actions/checkout@v3

    - name: Setup Python
      uses: actions/setup-python@v3
      with:
      # Version range or exact version of a Python version to use, using SemVer's version range syntax.
        python-version: 3.8

    # Download the Databricks CLI. See https://github.com/databricks/setup-cli
    - uses: databricks/setup-cli@main

    - name: Install mods
      run: |
        pip install pytest setuptools wheel

    - name: Extract branch name
      shell: bash
      run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
      id: extract_branch

    - name: Update Databricks Git folder
      run: |
        databricks repos update ${{env.REPO_PATH}} --branch "${{ steps.extract_branch.outputs.branch }}"

    - name: Build Wheel and send to Databricks DBFS workspace location
      run: |
        cd $GITHUB_WORKSPACE
        python setup.py bdist_wheel
        dbfs cp --overwrite ./dist/* ${{env.DBFS_LIB_PATH}}
        # there is only one wheel file; this line copies it with the original version number in file name and overwrites if that version of wheel exists; it does not affect the other files in the path
        dbfs cp --overwrite ./dist/* ${{env.DBFS_LIB_PATH}}${{env.LATEST_WHEEL_NAME}} # this line copies the wheel file and overwrites the latest version with it

Update the following environment variable values with your own:
- DBFS_LIB_PATH: The path in DBFS to the libraries (wheels) you will use in this automation, which starts with dbfs:. For example,dbfs:/mnt/myproject/libraries.
- REPO_PATH: The path in your Databricks workspace to the Git folder where notebooks will be updated.
- LATEST_WHEEL_NAME: The name of the last-compiled Python wheel file (.whl). This is used to avoid manually updating wheel versions in your Databricks jobs. For example, your_wheel-latest-py3-none-any.whl.
Select Commit changes… to commit the script as a GitHub Actions workflow. After the pull request for this workflow is merged, go to the Actions tab of the Git repository and confirm that the actions are successful.