This article is an introduction to CI/CD on Databricks. CI/CD falls under DevOps, a combination of development and operations tasks.
For an overview of CI/CD for machine learning projects on Databricks, see How does Databricks support CI/CD for machine learning?.
Continuous integration and continuous delivery/continuous deployment (CI/CD) refers to the process of developing and delivering software in short, frequent cycles through the use of automation pipelines. CI/CD is common to software development, but it is becoming increasingly necessary to data engineering and data science. By automating the building, testing, and deployment of code, development teams are able to deliver releases more frequently and reliably than with the manual processes still common to data engineering and data science teams.
Continuous integration begins with the practice of periodically committing code to a branch within a source code repository. Each commit is then merged with the commits from other developers to prevent version conflicts. Changes are further validated by creating a build and running automated tests against that build. This process ultimately results in a deployment package that you can deploy to a target environment, in this case, a Databricks workspace.
Project assets in a Databricks implementation typically in a CI/CD pipeline are:
Notebooks: Databricks notebooks are often a key part of data engineering and data science workflows. You can use version control for notebooks, and also validate and test them as part of a CI/CD pipeline. You can run automated tests against notebooks to check whether they are functioning as expected.
Libraries: Manage the library dependencies required to run your code in deployment. Use version control on libraries and include them in automated testing and validation.
Workflows: Databricks Workflows are comprised of jobs that allow you to schedule and run automated tasks using notebooks or Spark jobs.
Infrastructure: Infrastructure includes clusters, workspaces, and storage. Infrastructure changes can be validated and tested as part of a CI/CD pipeline, ensuring that they are consistent and error-free.
Data pipelines: You can also include data pipelines in CI/CD automation, using Delta Live Tables, the framework in Databricks for declaring data pipelines. See What is Delta Live Tables?.
By including these assets in a CI/CD pipeline, you can automate data workflows for validation, testing, and deployment in a consistent way, enabling you to deliver high-quality data solutions faster and more efficiently.
A typical configuration for a Databricks CI/CD pipeline includes the following steps.
Set up version control: Store your Databricks code and notebooks in a version control system like Git. This allows you to track changes over time and collaborate with other team members. See CI/CD techniques with Git and Databricks Repos.
Code: Develop code and unit tests in a Databricks notebook or using an external IDE.
Build: Automate the build process of your Databricks workspace using tools like Azure DevOps, Jenkins, or GitHub Actions. Through automation, you can build code consistently and integrate changes into your workspace.
Test: Develop and run automated tests to validate your code changes using tools like pytest or the Databricks CLI for automation.
Release: Generate a release package.
Deploy: Use a deployment tool like the Databricks CLI or the REST API to automate the deployment of your code changes to your Databricks workspace. You can also use the Azure DevOps release pipeline to deploy your code.
Monitor: Monitor the performance of your code and workflows in Databricks using tools like Azure Monitor or Datadog. This helps you identify and resolve any issues that arise in your production environment.
Iterate: Make small, frequent iterations to improve and update your data engineering or data science project. Small changes are easier to roll back than large ones.