Software engineering best practices for notebooks
This article provides a hands-on walkthrough that demonstrates how to apply software engineering best practices to your Databricks notebooks, including version control, code sharing, testing, and optionally continuous integration and continuous delivery or deployment (CI/CD).
In this walkthrough, you will:
Add notebooks to Databricks Git folders for version control.
Extract portions of code from one of the notebooks into a shareable module.
Test the shared code.
Run the notebooks from a Databricks job.
Optionally apply CI/CD to the shared code.
Requirements
To complete this walkthrough, you must provide the following resources:
A remote repository with a Git provider that Databricks supports. This article’s walkthrough uses GitHub. This walkthrough assumes that you have a GitHub repository named
best-notebooks
available. (You can give your repository a different name. If you do, replacebest-notebooks
with your repo’s name throughout this walkthrough.) Create a GitHub repo if you do not already have one.Note
If you create a new repo, be sure to initialize the repository with at least one file, for example a
README
file.A Databricks workspace. Create a workspace if you do not already have one.
A Databricks all-purpose cluster in the workspace. To run notebooks during the design phase, you attach the notebooks to a running all-purpose cluster. Later on, this walkthrough uses a Databricks job to automate running the notebooks on this cluster. (You can also run jobs on job clusters that exist only for the jobs’ lifetimes.) Create an all-purpose cluster if you do not already have one.
Step 1: Set up Databricks Git folders
In this step, you connect your existing GitHub repo to Databricks Git folders in your existing Databricks workspace.
To enable your workspace to connect to your GitHub repo, you must first provide your workspace with your GitHub credentials, if you have not done so already.
Step 1.1: Provide your GitHub credentials
Click your username at the top right of the workspace, and then click Settings in the dropdown list.
In the Settings sidebar, under User, click Linked accounts.
Under Git integration, for Git provider, select GitHub.
Click Personal access token.
For Git provider username or email, enter your GitHub username.
For Token, enter your GitHub personal access token (classic). This personal access token (classic) must have the repo and workflow permissions.
Click Save.
Step 1.2: Connect to your GitHub repo
On the workspace sidebar, click Workspace.
In the Workspace browser, expand Workspace > Users.
Right-click your username folder, and then click Create > Git folder.
In the Create Git folder dialog:
For Git repository URL, enter the GitHub Clone with HTTPS URL for your GitHub repo. This article assumes that your URL ends with
best-notebooks.git
, for examplehttps://github.com/<your-GitHub-username>/best-notebooks.git
.For Git provider, select GitHub.
Leave Git folder name set to the name of your repo, for example
best-notebooks
.Click Create Git folder.
Step 2: Import and run the notebook
In this step, you import an existing external notebook into your repo. You could create your own notebooks for this walkthrough, but to speed things up we provide them for you here.
Step 2.1: Create a working branch in the repo
In this substep, you create a branch named eda
in your repo. This branch enables you to work on files and code independently from your repo’s main
branch, which is a software engineering best practice. (You can give your branch a different name.)
Note
In some repos, the main
branch may be named master
instead. If so, replace main
with master
throughout this walkthrough.
Tip
If you’re not familiar with working in Git branches, see Git Branches - Branches in a Nutshell on the Git website.
The Git folder from Step 1.2 should be open. If not, then in the Workspace sidebar, expand Workspace > Users, then expand your username folder, and click your Git folder.
Next to the folder name under the workspace navigation breadcrumb, click the main Git branch button.
In the best-notebooks dialog, click the Create branch button.
Note
If your repo has a name other than
best-notebooks
, this dialog’s title will be different, here and throughout this walkthrough.Enter
eda
, and click Create.Close this dialog.
Step 2.2: Import the notebook into the repo
In this substep, you import an existing notebook from another repo into your repo. This notebook does the following:
Copies a CSV file from the owid/covid-19-data GitHub repository onto a cluster in your workspace. This CSV file contains public data about COVID-19 hospitalizations and intensive care metrics from around the world.
Filters the data to contain metrics from only the United States.
Displays a plot of the data.
Saves the pandas DataFrame as a Pandas API on Spark DataFrame.
Performs data cleansing on the Pandas API on Spark DataFrame.
Writes the Pandas API on Spark DataFrame as a Delta table in your workspace.
Displays the Delta table’s contents.
While you could create your own notebook in your repo here, importing an existing notebook instead helps to speed up this walkthrough. To create a notebook in this branch or move an existing notebook into this branch instead of importing a notebook, see Workspace files basic usage.
From the best-notebooks Git folder, click Create > Folder.
In the New folder dialog, enter
notebooks
, and then click Create.From the notebooks folder, click the kebab, then Import.
In the Import dialog:
For Import from, select URL.
Enter the URL to the raw contents of the
covid_eda_raw
notebook in thedatabricks/notebook-best-practices
repo in GitHub. To get this URL:Go to https://github.com/databricks/notebook-best-practices.
Click the
notebooks
folder.Click the
covid_eda_raw.py
file.Click Raw.
Copy the full URL from your web browser’s address bar over into the Import dialog.
Note
The Import dialog works with Git URLs for public repositories only.
Click Import.
Step 2.3: Run the notebook
If the notebook is not already showing, open the notebooks folder, and then click the covid_eda_raw notebook inside of the folder.
Select the cluster to attach this notebook to. For instructions on creating a cluster, see Create a cluster.
Click Run All.
Wait while the notebook runs.
After the notebook finishes running, in the notebook you should see a plot of the data as well as over 600 rows of raw data in the Delta table. If the cluster was not already running when you started running this notebook, it could take several minutes for the cluster to start up before displaying the results.
Step 2.4: Check in and merge the notebook
In this substep, you save your work so far to your GitHub repo. You then merge the notebook from your working branch into your repo’s main
branch.
Next to the notebook’s name, click the eda Git branch button.
In the best-notebooks dialog, on the Changes tab, make sure the notebooks/covid_eda_raw.py file is selected.
For Commit message (required), enter
Added raw notebook
.For Description (optional), enter
This is the first version of the notebook.
Click Commit & Push.
Click the pull request link in Create a pull request on your git provider in the banner.
In GitHub, create the pull request, and then merge the pull request into the
main
branch.Back in your Databricks workspace, close the best-notebooks dialog if it is still showing.
Step 5: Create a job to run the notebooks
In previous steps, you tested your shared code manually and ran your notebooks manually. In this step, you use a Databricks job to test your shared code and run your notebooks automatically, either on-demand or on a regular schedule.
Step 5.1: Create a job task to run the testing notebook
On the workspace sidebar, click Workflows.
On the Jobs tab, click Create Job.
Edit the name of the job to be
covid_report
.For Task name, enter
run_notebook_tests
.For Type, select Notebook.
For Source, select Git provider.
Click Add a git reference.
In the Git information dialog:
For Git repository URL, enter the GitHub Clone with HTTPS URL for your GitHub repo. This article assumes that your URL ends with
best-notebooks.git
, for examplehttps://github.com/<your-GitHub-username>/best-notebooks.git
.For Git provider, select GitHub.
For Git reference (branch / tag / commit), enter
main
.Next to Git reference (branch / tag / commit), select branch.
Click Confirm.
For Path, enter
notebooks/run_unit_tests
. Do not add the.py
file extension.For Cluster, select the cluster from the previous step.
Click Create task.
Note
In this scenario, Databricks does not recommend that you use the schedule button in the notebook as described in Create and manage scheduled notebook jobs to schedule a job to run this notebook periodically. This is because the schedule button creates a job by using the latest working copy of the notebook in the workspace repo. Instead, Databricks recommends that you follow the preceding instructions to create a job that uses the latest committed version of the notebook in the repo.
Step 5.2: Create a job task to run the main notebook
Click the + Add task icon.
A pop-up menu appears. Select Notebook.
For Task name, enter
run_main_notebook
.For Type, select Notebook.
For Path, enter
notebooks/covid_eda_modular
. Do not add the.py
file extension.For Cluster, select the cluster from the previous step.
Verify Depends on value is
run_notebook-tests
.Click Create task.
Step 5.3 Run the job
Click Run now.
In the pop-up, click View run.
Note
If the pop-up disappears too quickly, then do the following:
On the sidebar in the Data Science & Engineering or Databricks Mosaic AI environment, click Workflows.
On the Job runs tab, click the Start time value for the latest job with covid_report in the Jobs column.
To see the job results, click on the run_notebook_tests tile, the run_main_notebook tile, or both. The results on each tile are the same as if you ran the notebooks yourself, one by one.
Note
This job ran on-demand. To set up this job to run on a regular basis, see Trigger types for Databricks Jobs.
(Optional) Step 6: Set up the repo to test the code and run the notebook automatically whenever the code changes
In the previous step, you used a job to automatically test your shared code and run your notebooks at a point in time or on a recurring basis. However, you may prefer to trigger tests automatically when changes are merged into your GitHub repo, using a CI/CD tool such as GitHub Actions.
Step 6.1: Set up GitHub access to your workspace
In this substep, you set up a GitHub Actions workflow that runs jobs in the workspace whenever changes are merged into your repository. You do this by giving GitHub a unique Databricks token for access.
For security reasons, Databricks discourages you from giving your Databricks workspace user’s personal access token to GitHub. For instructions, see the GCP section of the Run Databricks Notebook GitHub Action page in the GitHub Actions Marketplace.
Important
Notebooks are run with all of the workspace permissions of the identity that is associated with the token, so Databricks recommends using a service principal. If you really want to give your Databricks workspace user’s personal access token to GitHub for personal exploration purposes only, and you understand that for security reasons Databricks discourages this practice, see the instructions to create your workspace user’s personal access token.
Step 6.2: Add the GitHub Actions workflow
In this substep, you add a GitHub Actions workflow to run the run_unit_tests
notebook whenever there is a pull request to the repo.
This substep stores the GitHub Actions workflow in a file that is stored within multiple folder levels in your GitHub repo. GitHub Actions requires a specific nested folder hierarchy to exist in your repo in order to work properly. To complete this step, you must use the website for your GitHub repo, because the Databricks Git folder user interface does not support creating nested folder hierarchies.
In the website for your GitHub repo, click the Code tab.
Click the arrow next to main to expand the Switch branches or tags drop-down list.
In the Find or create a branch box, enter
adding_github_actions
.Click Create branch: adding_github_actions from ‘main’.
Click Add file > Create new file.
For Name your file, enter
.github/workflows/databricks_pull_request_tests.yml
.In the editor window, enter the following code. This code uses the pull_request hook from the Run Databricks Notebook GitHub Action to run the
run_unit_tests
notebook.In the following code, replace:
<your-workspace-instance-URL>
with your Databricks instance name.<your-access-token>
with the token that you generated earlier.<your-cluster-id>
with your target cluster ID.
name: Run pre-merge Databricks tests on: pull_request: env: # Replace this value with your workspace instance name. DATABRICKS_HOST: https://<your-workspace-instance-name> jobs: unit-test-notebook: runs-on: ubuntu-latest timeout-minutes: 15 steps: - name: Checkout repo uses: actions/checkout@v2 - name: Run test notebook uses: databricks/run-notebook@main with: databricks-token: <your-access-token> local-notebook-path: notebooks/run_unit_tests.py existing-cluster-id: <your-cluster-id> git-commit: "${{ github.event.pull_request.head.sha }}" # Grant all users view permission on the notebook's results, so that they can # see the result of the notebook, if they have related access permissions. access-control-list-json: > [ { "group_name": "users", "permission_level": "CAN_VIEW" } ] run-name: "EDA transforms helper module unit tests"
Click Commit changes.
In the Commit changes dialog, enter
Create databricks_pull_request_tests.yml
into Commit messageSelect Commit directly to the adding_github_actions branch and click Commit changes.
On the Code tab, click Compare & pull request, and then create the pull request.
On the pull request page, wait for the icon next to Run pre-merge Databricks tests / unit-test-notebook (pull_request) to display a green check mark. (It may take a few moments for the icon to appear.) If there is a red X instead of a green check mark, click Details to find out why. If the icon or Details are no longer showing, click Show all checks.
If the green check mark appears, merge the pull request into the
main
branch.