What is CI/CD on Databricks?
This article is an introduction to CI/CD on Databricks. Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering software in short, frequent cycles through the use of automation pipelines. CI/CD is common to software development, and is becoming increasingly necessary to data engineering and data science. By automating the building, testing, and deployment of code, development teams are able to deliver releases more reliably than with the manual processes still common to data engineering and data science teams.
Databricks recommends using Databricks Asset Bundles for CI/CD, which enable the development and deployment of complex data, analytics, and ML projects for the Databricks platform. Bundles allow you to easily manage many custom configurations and automate builds, tests, and deployments of your projects to Databricks development, staging, and production workspaces.
For an overview of CI/CD for machine learning projects on Databricks, see How does Databricks support CI/CD for machine learning?.
What’s in a CI/CD pipeline on Databricks?
You can use Databricks Asset Bundles to define and programmatically manage your Databricks CI/CD implementation, which usually includes:
Notebooks: Databricks notebooks are often a key part of data engineering and data science workflows. You can use version control for notebooks, and also validate and test them as part of a CI/CD pipeline. You can run automated tests against notebooks to check whether they are functioning as expected.
Libraries: Manage the library dependencies required to run your deployed code. Use version control on libraries and include them in automated testing and validation.
Workflows: Databricks Jobs are comprised of jobs that allow you to schedule and run automated tasks using notebooks or Spark jobs.
Data pipelines: You can also include data pipelines in CI/CD automation, using Delta Live Tables, the framework in Databricks for declaring data pipelines.
Infrastructure: Infrastructure configuration includes definitions and provisioning information for clusters, workspaces, and storage for target environments. Infrastructure changes can be validated and tested as part of a CI/CD pipeline, ensuring that they are consistent and error-free.
Steps for CI/CD on Databricks
A typical flow for a Databricks CI/CD pipeline includes the following steps:
Store: Store your Databricks code and notebooks in a version control system like Git. This allows you to track changes over time and collaborate with other team members. See CI/CD techniques with Git and Databricks Git folders (Repos) and bundle Git settings.
Code: Develop code and unit tests in a Databricks notebook in the workspace or locally using an external IDE. Databricks provides a Visual Studio Code extension that makes it easy to develop and deploy changes to Databricks workspaces.
Build: Use Databricks Asset Bundles settings to automatically build certain artifacts during deployments. See artifacts.
Deploy: Deploy changes to the Databricks workspace using Databricks Asset Bundles in conjunction with tools like Azure DevOps, Jenkins, or GitHub Actions. See Databricks Asset Bundle deployment modes.
Test: Develop and run automated tests to validate your code changes using tools like pytest.
Run: Use the Databricks CLI in conjunction with Databricks Asset Bundles to automate runs in your Databricks workspaces. See Run a bundle.
Monitor: Monitor the performance of your code and workflows in Databricks using tools like Azure Monitor or Datadog. This helps you identify and resolve any issues that arise in your production environment.
Iterate: Make small, frequent iterations to improve and update your data engineering or data science project. Small changes are easier to roll back than large ones.