CI/CD techniques with Git and Databricks Git folders (Repos)
Learn techniques for using Databricks Git folders in CI/CD workflows. By configuring Databricks Git folders in the workspace, you can use source control for project files in Git repositories and you can integrate them into your data engineering pipelines.
The following figure shows an overview of the techniques and workflow.
For an overview of CI/CD with Databricks, see What is CI/CD on Databricks?.
Development flow
Databricks Git folders have user-level folders. User-level folders are automatically created when users first clone a remote repository. You can think of Databricks Git folders in user folders as “local checkouts” that are individual for each user and where users make changes to their code.
In your user folder in Databricks Git folders, clone your remote repository. A best practice is to create a new feature branch or select a previously created branch for your work, instead of directly committing and pushing changes to the main branch. You can make changes, commit, and push changes in that branch. When you are ready to merge your code, you can do so in the Git folders UI.
Requirements
This workflow requires that you have already set up your Git integration.
Note
Databricks recommends that each developer works on their own feature branch. For information about how to resolve merge conflicts, see Resolve merge conflicts.
Collaborate in Git folders
The following workflow uses a branch called feature-b
that is based on the main branch.
Clone your existing Git repository to your Databricks workspace.
Use the Git folders UI to create a feature branch from the main branch. This example uses a single feature branch
feature-b
for simplicity. You can create and use multiple feature branches to do your work.Make your modifications to Databricks notebooks and other files in the repo.
Contributors can now clone the Git repository into their own user folder.
Working on a new branch, a coworker makes changes to the notebooks and other files in the Git folder.
The contributor commits and pushes their changes to the Git provider.
To merge changes from other branches or rebase the feature-b branch in Databricks, in the Git folders UI use one of the following workflows:
Merge branches. If there’s no conflict, the merge is pushed to the remote Git repository using
git push
.
When you are ready to merge your work to the remote Git repository and
main
branch, use the Git folders UI to merge the changes from feature-b. If you prefer, you can instead merge changes directly to the Git repository backing your Git folder.
Production job workflow
Databricks Git folders provides two options for running your production jobs:
Option 1: Provide a remote Git reference in the job definition. For example, run a specific notebook in the
main
branch of a Git repository.Option 2: Set up a production Git repository and call Repos APIs to update it programmatically. Run jobs against the Databricks Git folder that clones this remote repository. The Repos API call should be the first task in the job.
Option 1: Run jobs using notebooks in a remote repository
Simplify the job definition process and keep a single source of truth by running a Databricks job using notebooks located in a remote Git repository. This Git reference can be a Git commit, tag, or branch and is provided by you in the job definition.
This helps prevent unintentional changes to your production job, such as when a user makes local edits in a production repository or switches branches. It also automates the CD step as you do not need to create a separate production Git folder in Databricks, manage permissions for it, and keep it updated.
Option 2: Set up a production Git folder and Git automation
In this option, you set up a production Git folder and automation to update the Git folder on merge.
Step 1: Set up top-level folders
The admin creates non-user top-level folders. The most common use case for these top-level folders is to create development, staging, and production folders that contain Databricks Git folders for the appropriate versions or branches for development, staging, and production. For example, if your company uses the main
branch for production, the “production” Git folder must have the main
branch checked out in it.
Typically permissions on these top-level folders are read-only for all non-admin users within the workspace. For such top-level folders we recommend you only provide service principal(s) with CAN EDIT and CAN MANAGE permissions to avoid accidental edits to your production code by workspace users.
Step 2: Set up automated updates to Databricks Git folders with the Git folders API
To keep a Git folder in Databricks at the latest version, you can set up Git automation to call the Repos API. In your Git provider, set up automation that, after every successful merge of a PR into the main branch, calls the Repos API endpoint on the appropriate Git folder to update it to the latest version.
For example, on GitHub this can be achieved with GitHub Actions. For more information, see the Repos API.
To call any Databricks REST API from within a Databricks notebook cell, first install the Databricks SDK with %pip install databricks-sdk --upgrade
(for the latest Databricks REST APIs) and then import ApiClient
from databricks.sdk.core
.
Note
If %pip install databricks-sdk --upgrade
returns an error that “The package could not be found”, then the databricks-sdk
package has not been previously installed. Re-run the command without the --upgrade
flag: %pip install databricks-sdk
.
You can also run Databricks SDK APIs from a notebook to retrieve the service principals for your workspace. Here’s an example using Python and the Databricks SDK for Python.
You can also use tools such as curl
or Terraform. You cannot use the Databricks user interface.
To learn more about service principals on Databricks, see Manage service principals. For information about service principals and CI/CD, see Service principals for CI/CD. For more details on using the Databricks SDK from a notebook, read Use the Databricks SDK for Python from within a Databricks notebook.
Terraform integration
You can also manage Databricks Git folders in a fully automated setup using Terraform and databricks_repo:
resource "databricks_repo" "this" {
url = "https://github.com/user/demo.git"
}
To use Terraform to add Git credentials to a service principal, add the following configuration:
provider "databricks" {
# Configuration options
}
provider "databricks" {
alias = "sp"
host = "https://....cloud.databricks.com"
token = databricks_obo_token.this.token_value
}
resource "databricks_service_principal" "sp" {
display_name = "service_principal_name_here"
}
resource "databricks_obo_token" "this" {
application_id = databricks_service_principal.sp.application_id
comment = "PAT on behalf of ${databricks_service_principal.sp.display_name}"
lifetime_seconds = 3600
}
resource "databricks_git_credential" "sp" {
provider = databricks.sp
depends_on = [databricks_obo_token.this]
git_username = "myuser"
git_provider = "azureDevOpsServices"
personal_access_token = "sometoken"
}
Configure an automated CI/CD pipeline with Databricks Git folders
Here is a simple automation that can be run as a GitHub Action.
Requirements
You have created a Git folder in a Databricks workspace that is tracking the base branch being merged into.
You have a Python package that creates the artifacts to place into a DBFS location. Your code must:
Update the repository associated with your preferred branch (such as
development
) to contain the latest versions of your notebooks.Build any artifacts and copy them to the library path.
Replace the last versions of build artifacts to avoid having to manually update artifact versions in your job.
Steps
Note
Step 1 must be performed by an admin of the Git repository.
Set up secrets so your code can access the Databricks workspace. Add the following secrets to the Github repository:
DEPLOYMENT_TARGET_URL: Set it to the workspace URL, but do not include the
/?o
substring.DEPLOYMENT_TARGET_TOKEN: Provide a Databricks Personal Access Token (PAT) value. You can generate a Databricks PAT by following the instructions in Configure Git credentials & connect a remote repo to Databricks.
Navigate to the Actions tab of your Git repository and click the New workflow button. At the top of the page, select Set up a workflow yourself and paste in this script:
# This is a basic automation workflow to help you get started with GitHub Actions. name: CI # Controls when the workflow will run on: # Triggers the workflow on push for main and dev branch push: branches: # Set your base branch name here - your-base-branch-name # A workflow run is made up of one or more jobs that can run sequentially or in parallel jobs: # This workflow contains a single job called "deploy" deploy: # The type of runner that the job will run on runs-on: ubuntu-latest env: DBFS_LIB_PATH: dbfs:/path/to/libraries/ REPO_PATH: /Repos/path/here LATEST_WHEEL_NAME: latest_wheel_name.whl # Steps represent a sequence of tasks that will be executed as part of the job steps: # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it - uses: actions/checkout@v2 - name: Setup Python uses: actions/setup-python@v2 with: # Version range or exact version of a Python version to use, using SemVer's version range syntax. python-version: 3.8 - name: Install mods run: | pip install databricks-cli pip install pytest setuptools wheel - name: Configure CLI run: | echo "${{ secrets.DEPLOYMENT_TARGET_URL }} ${{ secrets.DEPLOYMENT_TARGET_TOKEN }}" | databricks configure --token - name: Extract branch name shell: bash run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})" id: extract_branch - name: Update Databricks Git folder run: | databricks repos update --path ${{env.REPO_PATH}} --branch "${{ steps.extract_branch.outputs.branch }}" - name: Build Wheel and send to Databricks workspace DBFS location run: | cd $GITHUB_WORKSPACE python setup.py bdist_wheel dbfs cp --overwrite ./dist/* ${{env.DBFS_LIB_PATH}} # there is only one wheel file; this line copies it with the original version number in file name and overwrites if that version of wheel exists; it does not affect the other files in the path dbfs cp --overwrite ./dist/* ${{env.DBFS_LIB_PATH}}${{env.LATEST_WHEEL_NAME}} # this line copies the wheel file and overwrites the latest version with it
Update the following environment variable values with your own:
DBFS_LIB_PATH: The path in DBFS to the libraries (wheels) you will use in this automation, which starts with
dbfs:
. For example,dbfs:/mnt/myproject/libraries
.REPO_PATH: The path in your Databricks workspace to the Git folder where notebooks will be updated. For example,
/Repos/Develop
.LATEST_WHEEL_NAME: The name of the last-compiled Python wheel file (
.whl
). This is used to avoid manually updating wheel versions in your Databricks jobs. For example,your_wheel-latest-py3-none-any.whl
.
Select Commit changes… to commit the script as a GitHub Actions workflow. After the pull request for this workflow is merged, go to the Actions tab of the Git repository and confirm that the actions are successful.