Pulumi Databricks resource provider
Note
This article covers Pulumi, which is developed by a third party. To contact the provider, see Pulumi Support.
This article shows you how to use Python and Pulumi, a third-party, infrastructure as code (IaC) platform that enables you to create, deploy, and manage Databricks resources by using familiar programming languages, tools, and engineering practices. Although this article shows you how to use Python and the Pulumi Databricks resource provider, Pulumi supports other languages in addition to Python for Databricks, including TypeScript, JavaScript, Go, and C#.
The Pulumi Databricks resource provider is based on the Databricks Terraform provider. For more information, see Terraform Cloud.
Requirements
A Pulumi account. Sign up for Pulumi, if you do not already have a Pulumi account. Pulumi is free for individuals, and offers a free tier for teams.
Python 3.6 or higher. To check whether you have Python installed, run the command
python --version
from your terminal or with PowerShell. Install Python, if you do not have it already installed.Note
Some installations of Python may require you to use
python3
instead ofpython
. If so, substitutepython
forpython3
throughout this article.Your Databricks workspace instance URL, for example
https://1234567890123456.7.gcp.databricks.com
.Databricks access credentials. Pulumi Databricks projects support the following Databricks authentication types:
Databricks personal access token authentication (
databricks:authType pat
)
See Configuration in the Pulumi Databricks repository in GitHub.
The following steps show how to create a Pulumi Databricks project with Python. For a tutorial from a purely cloud provider-first perspective instead, see Get Started with Google Cloud in the Pulumi documentation. For a tutorial from a programming language-first perspective instead, see Python, Node.js (JavaScript, TypeScript), Go, and .NET (C#, VB, F#) in the Pulumi documentation.
Step 1: Create a Pulumi project
In this step, on your local development machine you set up the necessary directory structure for a Pulumi project. You then create your Pulumi project within this directory structure.
From your terminal or with PowerShell, create an empty directory, and then switch to it, for example:
mkdir pulumi-demo cd pulumi-demo
md pulumi-demo cd pulumi-demo
Install Pulumi by running the following command, depending on your operating system:
Install Pulumi on Unix or Linux by using curl:
curl -fsSL https://get.pulumi.com | sh
Install Pulumi on macOS by using Homebrew:
brew install pulumi/tap/pulumi
Install Pulumi on Windows by using PowerShell with elevated permissions through the Chocolatey package manager:
choco install pulumi
For alternative Pulumi installation options, see Download and Install in the Pulumi documentation.
Create a basic Python Pulumi project by running the following command:
pulumi new python
Tip
You can also create a Pulumi project from your Pulumi account online (Projects > Create project). However, there is no project template for Databricks.
If prompted, press your Enter key and then use your web browser to sign in to your Pulumi account online, if you are not already signed in. After you sign in, return to your terminal or PowerShell.
When prompted for a project name, accept the default project name of
pulumi-demo
by pressing Enter.When prompted for a project description, enter
A demo Python Pulumi Databricks project
and press Enter.when prompted for a stack name, accept the default stack name of
dev
by pressing Enter. Pulumi creates the following files and subdirectory in yourpulumi-demo
directory:Pulumi.yaml
, which is a list of settings for your Pulumi project.__main__.py
, which contains the Python code that you write for your Pulumi project.requirements.txt
, which is a list of supporting Python code packages that Pulumi installs for your project..gitignore
, which is a list of files and directories that Git ignores if you want to push this project into a remote Git repository.The
venv
subdirectory contains supporting Python virtual environment code that Pulumi uses for your project.
Perform an initial deployment of your project’s
dev
stack by running the following command:pulumi up
When prompted to perform this update, press your up arrow key to navigate to yes and then press Enter.
Copy the View Live link that appears and paste it into your web browser’s address bar, which takes you to your Pulumi account online. The
dev
stack’s activity details for yourpulumi-demo
project appears. There is not much to see right now, because there aren’t any resources in your stack yet. You create these resources in the next step.
Step 2: Create Databricks resources
In this step, you use the Pulumi Databricks resource provider to create, in your existing Databricks workspace, a notebook and a job to run that notebook.
In the
__main.py__
file that Pulumi generated, use your preferred text editor or integrated development environment (IDE) to enter the following code. This code declares the Pulumi Databricks Notebook and Job resources and their settings:"""A Python Pulumi program""" import pulumi from pulumi_databricks import * from base64 import b64encode # Get the authenticated user's workspace home directory path and email address. # See https://www.pulumi.com/registry/packages/databricks/api-docs/getcurrentuser user_home_path = get_current_user().home user_email_address = get_current_user().user_name # Define the name prefix to prepend to the resource names that are created # for the Notebook and Job resources. To do this, you can use a Pulumi # configuration value instead of hard-coding the name prefix in this file. # # To set a Pulumi configuration value, run the following command, which sets # a "resource-prefix" configuration value to "pulumi-demo" in the # associated "Pulumi.<stack-name>.yaml" configuration file: # # pulumi config set resource-prefix "pulumi-demo" # # For more information about defining and retrieving hard-coded values, see # https://www.pulumi.com/docs/intro/concepts/config config = pulumi.config.Config() resource_prefix = config.require('resource-prefix') # Define cluster resource settings. node_type = config.require('node-type') # Create a Notebook resource. # See https://www.pulumi.com/registry/packages/databricks/api-docs/notebook # This example adds a single cell to the notebook, which is constructed from # a single base64-encoded string. In practice, you would replace this: # # language = "PYTHON", # content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8") # # With this: # # source = "path/to/local/my-notebook.py" # # To provide more notebook content easier and faster. Also, the notebook's language # is automatically detected. If you specify a notebook path, be sure that it does # not end in .ipynb, as Pulumi relies on the workspace import API, which doesn't # rely on any specific extensions such as .ipynb in the notebook path. notebook = Notebook( resource_name = f"{resource_prefix}-notebook", path = f"{user_home_path}/Pulumi/{resource_prefix}-notebook.py", language = 'PYTHON', content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8") ) # Export the URL of the Notebook, so that you can easily browse to it later. # See https://www.pulumi.com/docs/intro/concepts/stack/#outputs pulumi.export('Notebook URL', notebook.url) # Create a Job resource. # See https://www.pulumi.com/registry/packages/databricks/api-docs/job # This job uses the most recent Databricks Runtime long-term support (LTS) # runtime programmatic version ID at the time this article was first published, # which is 14.3.x-scala2.12. You can replace this with a later version. job = Job( resource_name = f"{resource_prefix}-job", name = f"{resource_prefix}-job", tasks = [ JobTaskArgs( task_key = f"{resource_prefix}-task", new_cluster = JobNewClusterArgs( num_workers = 1, spark_version = "14.3.x-scala2.12", node_type_id = node_type ), notebook_task = JobNotebookTaskArgs( notebook_path = f"{user_home_path}/Pulumi/{resource_prefix}-notebook.py" ) ) ], email_notifications = JobEmailNotificationsArgs( on_successes = [ user_email_address ], on_failures = [ user_email_address ] ) ) # Export the URL of the Job, so that you can easily browse to it later. # See https://www.pulumi.com/docs/intro/concepts/stack/#outputs pulumi.export('Job URL', job.url)
Define a configuration value named
resource-prefix
, and set it to the hard-coded value ofpulumi-demo
, by running the following command. Pulumi uses this configuration value to name the notebook and job:pulumi config set resource-prefix "pulumi-demo"
Pulumi creates a file named
Pulumi.dev.yaml
in the same directory as the__main__.py
file and adds the following code to this YAML file:config: pulumi-demo:resource_prefix: pulumi-demo
Using configuration values enables your code to be more modular and reusable. Now someone else can reuse your
__main__.py
file and define a different value for theresource_prefix
variable without changing the contents of the__main__.py
file.Define a configuration value named
node-type
, and set it to the following hard-coded value, by running the following command. Pulumi uses this configuration value to determine the type of cluster that the job runs on.pulumi config set node-type "n1-standard-4"
The contents of the
Pulumi.dev.yaml
file now look like this:config: pulumi-demo:node-type: n1-standard-4 pulumi-demo:resource-prefix: pulumi-demo
To enable Pulumi to authenticate with your Databricks workspace, define Databricks specific configuration values by running the related commands. For example, for Databricks personal access token authentication, run the following commands. In these commands:
Replace
<workspace-instance-url>
with your workspace instance URL, for examplehttps://1234567890123456.7.gcp.databricks.com
.Replace
<access-token>
with your access token’s value. Be sure to specify the--secret
option. This instructs Pulumi to encrypt your access token as a security best practice.Note
By default, Pulumi uses a per-stack encryption key managed by the Pulumi Service, and a per-value salt, to encrypt values. To use an alternative encryption provider, see Configuring Secrets Encryption in the Pulumi documentation.
pulumi config set databricks:host "<workspace-instance-url>" pulumi config set databricks:token "<access-token>" --secret
The contents of the
Pulumi.dev.yaml
file now look like this:config: databricks:host: <your-workspace-instance-url> databricks:token: secure: <an-encrypted-version-of-your-access-token> pulumi-demo:node-type: n1-standard-4 pulumi-demo:resource_prefix: pulumi-demo
To use a different Databricks authentication type, see the Requirements. See also Configuration in the Pulumi Databricks repository in GitHub.
Step 3: Deploy the resources
In this step, you activate a Python virtual environment that Pulumi provides for your project as part of running the Pulumi Python project template. This virtual environment helps ensure that you are using the correct version of Python, Pulumi, and the Pulumi Databricks resource provider together. There are several Python virtual environment frameworks available, such as venv, virtualenv, and pipenv. This article and the Pulumi Python project template use venv
. venv
is already included with Python. For more information, see Creating Virtual Environments.
Activate the Python virtual environment by running the following command from your
pulumi-demo
directory, depending on your operating system and shell type:Platform
Shell
Command to activate virtual environment
Unix, Linux, macOS
bash/zsh
source venv/bin/activate
fish
source venv/bin/activate.fish
csh/tcsh
source venv/bin/activate.csh
PowerShell Core
venv/bin/Activate.ps1
Windows
cmd.exe
venv\Scripts\activate.bat
PowerShell
venv\Scripts\Activate.ps1
Install the Pulumi Databricks resource provider from the Python Package Index (PyPI) into your virtual environment by running the following command:
pip install pulumi-databricks
Note
Some installations of
pip
may require you to usepip3
instead ofpip
. If so, substitutepip
forpip3
throughout this article.Preview the resources that Pulumi will create by running the following command:
pulumi preview
If there are any errors reported, fix them and then run the command again.
To view a detailed report in your Pulumi account online of what Pulumi will do, copy the View Live link that appears and paste it into your web browser’s address bar.
Create and deploy the resources to your Databricks workspace by running the following command:
pulumi up
When prompted to perform this update, press your up arrow key to navigate to yes and then press Enter. If there are any errors reported, fix them and then run the command again.
To view a detailed report in your Pulumi account online of what Pulumi did, copy the View Live link that appears and paste it into your web browser’s address bar.
Step 4: Interact with the resources
In this step, you run the job in your Databricks workspace, which runs the specified notebook.
To view the notebook that the job will run in your workspace, copy the Notebook URL link that appears and paste it into your web browser’s address bar.
To view the job that runs the notebook in your workspace, copy the Job URL link that appears and paste it into your web browser’s address bar.
To run the job, click the Run now button on the job page.
After the job finishes running, to view the job run’s results, in the Completed runs (past 60 days) list on the job page, click the most recent time entry in the Start time column. The Output pane shows the result of running the notebook’s code, which prints the numbers 1 through 10.
(Optional) Step 5: Make changes to a resource
In this optional step, you change the notebook’s code, redeploy the changed notebook, and then use the job to rerun the changed notebook.
If you do not want to make any changes to the notebook, skip ahead to Step 6: Clean up.
Back in the
__main.py__
file, change this line of code:content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8")
To this, and then save the file:
content_base64 = b64encode(b''' data = [ { "Category": 'A', "ID": 1, "Value": 121.44 }, { "Category": 'B', "ID": 2, "Value": 300.01 }, { "Category": 'C', "ID": 3, "Value": 10.99 }, { "Category": 'E', "ID": 4, "Value": 33.87} ] df = spark.createDataFrame(data) display(df) ''').decode("UTF-8")
This change instructs the notebook to print the contents of the specified DataFrame instead of the numbers 1 through 10.
Note
Make sure that the lines of code beginning with
data
and ending with''').decode("UTF-8")
are aligned with the edge of your code editor. Otherwise, Pulumi will insert additional whitespace into the notebook that may cause the new Python code to fail to run.Optionally, preview the resource that Pulumi will change by running the following command:
pulumi preview
If there are any errors reported, fix them and then run the command again.
To view a detailed report in your Pulumi account online of what Pulumi will do, copy the View Live link that appears and paste it into your web browser’s address bar.
Deploy the resource change to your Databricks workspace by running the following command:
pulumi up
When prompted to perform this update, press your up arrow key to navigate to yes and then press Enter. If there are any errors reported, fix them and then run the command again.
To view a detailed report in your Pulumi account online of what Pulumi did, copy the View Live link that appears and paste it into your web browser’s address bar.
To view the changed notebook in your workspace, copy the Notebook URL link that appears and paste it into your web browser’s address bar.
To rerun the job with the changed notebook, copy the Job URL link that appears and paste it into your web browser’s address bar. Then click the Run now button on the job page.
After the job finishes running, to view the job run’s results, in the Completed runs (past 60 days) list on the job page, click the most recent time entry in the Start time column. The Output pane shows the result of running the notebook’s code, which prints the contents of the specified DataFrame.
Step 6: Clean up
In this step, you instruct Pulumi to remove the notebook and job from your Databricks workspace as well as remove the pulumi-demo
project and its dev
stack from your Pulumi account online.
Remove the resources from your Databricks workspace by running the following command:
pulumi destroy
When prompted to perform this removal, press your up arrow key to navigate to yes and then press Enter.
Remove the Pulumi
pulumi-demo
project and itsdev
stack from your Pulumi account online by running the following command:pulumi stack rm dev
When prompted to perform this removal, type
dev
and then press Enter.To deactivate the
venv
Python virtual environment, run the following command:deactivate
Testing
You can test your Pulumi project before you deploy it. See Testing Pulumi programs in the Pulumi documentation.
For unit testing of Python-based Pulumi projects, you can write and run unit tests by using the Python test framework unittest along with the Pulumi package’s pulumi.runtime namespace. To run tests against simulated resources, you replace calls to Pulumi (and to Databricks) with mocks. See Unit testing Pulumi programs in the Pulumi documentation.
The following example file named infra.py
mocks an implementation of the notebook and job declared in this article’s main.py
file. The unit tests in this example check whether the Base64-encoded content of the notebook, the job’s name, and the email recipient for successful job runs all return expected values. As such, only those related properties are mocked here with example values. Also, required resource property values must always be provided, even if you do not plan to use them in your unit tests. In this example, these required values are set to random my-mock-
values, and those values are not tested.
# infra.py
from pulumi_databricks import (
Notebook,
Job,
JobEmailNotificationsArgs
)
notebook = Notebook(
resource_name = 'my-mock-notebook-resource-name',
path = 'my-mock-notebook-path',
content_base64 = 'ZGlzcGxheShzcGFyay5yYW5nZSgxMCkp'
)
job = Job(
resource_name = 'my-mock-job-resource-name',
name = 'pulumi-demo-job',
email_notifications = JobEmailNotificationsArgs(
on_successes = [ 'someone@example.com' ]
)
)
The following example file test_main.py
tests whether the related properties return their expected values.
# test_main.py
import pulumi
from pulumi_databricks import *
import unittest
import infra
# Set up mocking.
class MyMocks(pulumi.runtime.Mocks):
def new_resource(self, type_, name, inputs, provider, id_):
return [name + '_id', inputs]
def call(self, token, args, provider):
return {}
pulumi.runtime.set_mocks(MyMocks())
class TestNotebookAndJob(unittest.TestCase):
@pulumi.runtime.test
def test_notebook(self):
def check_notebook_content_base64(args):
content_base64 = args
# Does the notebook's Base64-encoded content match the expected value?
self.assertIn('ZGlzcGxheShzcGFyay5yYW5nZSgxMCkp', content_base64)
# Pass the mocked notebook's content_base64 property value to the test.
return pulumi.Output.all(infra.notebook.content_base64).apply(check_notebook_content_base64)
@pulumi.runtime.test
def test_job(self):
def check_job_name_and_email_onsuccesses(args):
name, email_notifications = args
# Does the job's name match the expected value?
self.assertIn('pulumi-demo-job', name)
# Does the email address for successful job runs match the expected value?
self.assertIn('someone@example.com', email_notifications['on_successes'])
# Pass into the test the mocked job's property values for the job's name
# and the job's email address for successful runs.
return pulumi.Output.all(
infra.job.name,
infra.job.email_notifications
).apply(check_job_name_and_email_onsuccesses)
To run these tests and to display their test results, run the following command from the Pulumi project’s root directory:
python -m unittest
For information about other kinds of tests that you can run, see the following articles in the Pulumi documentation: