Pulumi Databricks resource provider

Note

This article covers Pulumi, which is neither provided nor supported by Databricks. To contact the provider, see Pulumi Support.

This article shows you how to use Python and Pulumi, a third-party, infrastructure as code (IaC) platform that enables you to create, deploy, and manage Databricks resources by using familiar programming languages, tools, and engineering practices. Although this article shows you how to use Python and the Pulumi Databricks resource provider, Pulumi supports other languages in addition to Python for Databricks, including TypeScript, JavaScript, Go, and C#.

The Pulumi Databricks resource provider is based on the Databricks Terraform provider. For more information, see Terraform Cloud.

Requirements

  • A Pulumi account. Sign up for Pulumi, if you do not already have a Pulumi account. Pulumi is free for individuals, and offers a free tier for teams.

  • Python 3.6 or higher. To check whether you have Python installed, run the command python --version from your terminal or with PowerShell. Install Python, if you do not have it already installed.

    Note

    Some installations of Python may require you to use python3 instead of python. If so, substitute python for python3 throughout this article.

  • Your Databricks workspace instance URL, for example https://1234567890123456.7.gcp.databricks.com.

  • A Databricks personal access token for your Databricks workspace, to enable Pulumi to automatically build, deploy, and manage Databricks resources on your behalf. To create a Databricks personal access token, see Generate a personal access token and Manage personal access tokens.

The following steps show how to create a Pulumi Databricks project with Python. For a tutorial from a purely cloud provider-first perspective instead, see Get Started with Google Cloud in the Pulumi documentation. For a tutorial from a programming language-first perspective instead, see Python, Node.js (JavaScript, TypeScript), Go, and .NET (C#, VB, F#) in the Pulumi documentation.

Step 1: Create a Pulumi project

In this step, on your local development machine you set up the necessary directory structure for a Pulumi project. You then create your Pulumi project within this directory structure.

  1. From your terminal or with PowerShell, create an empty directory, and then switch to it, for example:

    mkdir pulumi-demo
    cd pulumi-demo
    
    md pulumi-demo
    cd pulumi-demo
    
  2. Install Pulumi by running the following command, depending on your operating system:

    Install Pulumi on Unix or Linux by using curl:

    curl -fsSL https://get.pulumi.com | sh
    

    Install Pulumi on macOS by using Homebrew:

    brew install pulumi/tap/pulumi
    

    Install Pulumi on Windows by using PowerShell with elevated permissions through the Chocolatey package manager:

    choco install pulumi
    

    For alternative Pulumi installation options, see Download and Install in the Pulumi documentation.

  3. Create a basic Python Pulumi project by running the following command:

    pulumi new python
    

    Tip

    You can also create a Pulumi project from your Pulumi account online (Projects > Create project). However, there is no project template for Databricks.

  4. If prompted, press your Enter key and then use your web browser to sign in to your Pulumi account online, if you are not already signed in. After you sign in, return to your terminal or PowerShell.

  5. When prompted for a project name, accept the default project name of pulumi-demo by pressing Enter.

  6. When prompted for a project description, enter A demo Python Pulumi Databricks project and press Enter.

  7. when prompted for a stack name, accept the default stack name of dev by pressing Enter. Pulumi creates the following files and subdirectory in your pulumi-demo directory:

    • Pulumi.yaml, which is a list of settings for your Pulumi project.

    • __main__.py, which contains the Python code that you write for your Pulumi project.

    • requirements.txt, which is a list of supporting Python code packages that Pulumi installs for your project.

    • .gitignore, which is a list of files and directories that Git ignores if you want to push this project into a remote Git repository.

    • The venv subdirectory contains supporting Python virtual environment code that Pulumi uses for your project.

  8. Perform an initial deployment of your project’s dev stack by running the following command:

    pulumi up
    
  9. When prompted to perform this update, press your up arrow key to navigate to yes and then press Enter.

  10. Copy the View Live link that appears and paste it into your web browser’s address bar, which takes you to your Pulumi account online. The dev stack’s activity details for your pulumi-demo project appears. There is not much to see right now, because there aren’t any resources in your stack yet. You create these resources in the next step.

Step 2: Create Databricks resources

In this step, you use the Pulumi Databricks resource provider to create, in your existing Databricks workspace, a notebook and a job to run that notebook.

  1. In the __main.py__ file that Pulumi generated, use your preferred text editor or integrated development environment (IDE) to enter the following code. This code declares the Pulumi Databricks Notebook and Job resources and their settings:

    """A Python Pulumi program"""
    
    import pulumi
    from pulumi_databricks import *
    from base64 import b64encode
    
    # Get the authenticated user's workspace home directory path and email address.
    # See https://www.pulumi.com/registry/packages/databricks/api-docs/getcurrentuser
    user_home_path     = get_current_user().home
    user_email_address = get_current_user().user_name
    
    # Define the name prefix to prepend to the resource names that are created
    # for the Notebook and Job resources. To do this, you can use a Pulumi
    # configuration value instead of hard-coding the name prefix in this file.
    #
    # To set a Pulumi configuration value, run the following command, which sets
    # a "resource-prefix" configuration value to "pulumi-demo" in the
    # associated "Pulumi.<stack-name>.yaml" configuration file:
    #
    # pulumi config set resource-prefix "pulumi-demo"
    #
    # For more information about defining and retrieving hard-coded values, see
    # https://www.pulumi.com/docs/intro/concepts/config
    config = pulumi.Config()
    resource_prefix = config.require('resource-prefix')
    
    # Define cluster resource settings.
    node_type = config.require('node-type')
    
    
    # Create a Notebook resource.
    # See https://www.pulumi.com/registry/packages/databricks/api-docs/notebook
    # This example adds a single cell to the notebook, which is constructed from
    # a single base64-encoded string. In practice, you would replace this:
    #
    # language       = "PYTHON",
    # content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8")
    #
    # With this:
    #
    # source         = "path/to/local/my-notebook.py"
    #
    # To provide more notebook content easier and faster. Also, the notebook's language
    # is automatically detected. If you specify a notebook path, be sure that it does
    # not end in .ipynb, as Pulumi relies on the workspace import API, which doesn't
    # rely on any specific extensions such as .ipynb in the notebook path.
    notebook = Notebook(
      resource_name  = f"{resource_prefix}-notebook",
      path           = f"{user_home_path}/Pulumi/{resource_prefix}-notebook.py",
      language       = "PYTHON",
      content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8")
    )
    
    # Export the URL of the Notebook, so that you can easily browse to it later.
    # See https://www.pulumi.com/docs/intro/concepts/stack/#outputs
    pulumi.export('Notebook URL', notebook.url)
    
    # Create a Job resource.
    # See https://www.pulumi.com/registry/packages/databricks/api-docs/job
    # This job uses the most recent Databricks Runtime long-term support (LTS)
    # runtime programmatic version ID at the time this article was first published,
    # which is 10.4.x-scala2.12. You can replace this with a later version.
    job = Job(
      resource_name = f"{resource_prefix}-job",
      new_cluster   = JobNewClusterArgs(
        num_workers   = 1,
        spark_version = "10.4.x-scala2.12",
        node_type_id  = node_type
      ),
      notebook_task = JobNotebookTaskArgs(
        notebook_path = f"{user_home_path}/Pulumi/{resource_prefix}-notebook.py"
      ),
      email_notifications = JobEmailNotificationsArgs(
        on_successes = [ user_email_address ],
        on_failures  = [ user_email_address ]
      )
    )
    
    # Export the URL of the Job, so that you can easily browse to it later.
    # See https://www.pulumi.com/docs/intro/concepts/stack/#outputs
    pulumi.export('Job URL', job.url)
    
  2. Define a configuration value named resource-prefix, and set it to the hard-coded value of pulumi-demo, by running the following command. Pulumi uses this configuration value to name the notebook and job:

    pulumi config set resource-prefix "pulumi-demo"
    

    Pulumi creates a file named Pulumi.dev.yaml in the same directory as the __main__.py file and adds the following code to this YAML file:

    config:
      pulumi-demo:resource_prefix: pulumi-demo
    

    Using configuration values enables your code to be more modular and reusable. Now someone else can reuse your __main__.py file and define a different value for the resource_prefix variable without changing the contents of the __main__.py file.

  3. Define a configuration value named node-type, and set it to the following hard-coded value, by running the following command. Pulumi uses this configuration value to determine the type of cluster that the job runs on.

    pulumi config set node-type "n1-standard-4"
    

    The contents of the Pulumi.dev.yaml file now look like this:

    config:
      pulumi-demo:node-type: n1-standard-4
      pulumi-demo:resource-prefix: pulumi-demo
    
  4. To enable Pulumi to authenticate with your Databricks workspace, define two Databricks specific configuration values by running the following commands. In these commands:

    • Replace <workspace-instance-url> with your workspace instance URL, for example https://1234567890123456.7.gcp.databricks.com.

    • Replace <access-token> with your access token’s value. Be sure to specify the --secret option. This instructs Pulumi to encrypt your access token as a security best practice.

      Note

      By default, Pulumi uses a per-stack encryption key managed by the Pulumi Service, and a per-value salt, to encrypt values. To use an alternative encryption provider, see Configuring Secrets Encryption in the Pulumi documentation.

    pulumi config set databricks:host "<workspace-instance-url>"
    pulumi config set databricks:token "<access-token>" --secret
    

    The contents of the Pulumi.dev.yaml file now look like this:

    config:
      databricks:host: <your-workspace-instance-url>
      databricks:token:
        secure: <an-encrypted-version-of-your-access-token>
      pulumi-demo:node-type: n1-standard-4
      pulumi-demo:resource_prefix: pulumi-demo
    

Step 3: Deploy the resources

In this step, you activate a Python virtual environment that Pulumi provides for your project as part of running the Pulumi Python project template. This virtual environment helps ensure that you are using the correct version of Python, Pulumi, and the Pulumi Databricks resource provider together. There are several Python virtual environment frameworks available, such as venv, virtualenv, and pipenv. This article and the Pulumi Python project template use venv. venv is already included with Python. For more information, see Creating Virtual Environments.

  1. Activate the Python virtual environment by running the following command from your pulumi-demo directory, depending on your operating system and shell type:

    Platform

    Shell

    Command to activate virtual environment

    Unix, Linux, macOS

    bash/zsh

    source venv/bin/activate

    fish

    source venv/bin/activate.fish

    csh/tcsh

    source venv/bin/activate.csh

    PowerShell Core

    venv/bin/Activate.ps1

    Windows

    cmd.exe

    venv\Scripts\activate.bat

    PowerShell

    venv\Scripts\Activate.ps1

  2. Install the Pulumi Databricks resource provider from the Python Package Index (PyPI) into your virtual environment by running the following command:

    pip install pulumi-databricks
    

    Note

    Some installations of pip may require you to use pip3 instead of pip. If so, substitute pip for pip3 throughout this article.

  3. Preview the resources that Pulumi will create by running the following command:

    pulumi preview
    

    If there are any errors reported, fix them and then run the command again.

    To view a detailed report in your Pulumi account online of what Pulumi will do, copy the View Live link that appears and paste it into your web browser’s address bar.

  4. Create and deploy the resources to your Databricks workspace by running the following command:

    pulumi up
    
  5. When prompted to perform this update, press your up arrow key to navigate to yes and then press Enter. If there are any errors reported, fix them and then run the command again.

  6. To view a detailed report in your Pulumi account online of what Pulumi did, copy the View Live link that appears and paste it into your web browser’s address bar.

Step 4: Interact with the resources

In this step, you run the job in your Databricks workspace, which runs the specified notebook.

  1. To view the notebook that the job will run in your workspace, copy the Notebook URL link that appears and paste it into your web browser’s address bar.

  2. To view the job that runs the notebook in your workspace, copy the Job URL link that appears and paste it into your web browser’s address bar.

  3. To run the job, click the Run now button on the job page.

  4. After the job finishes running, to view the job run’s results, in the Completed runs (past 60 days) list on the job page, click the most recent time entry in the Start time column. The Output pane shows the result of running the notebook’s code, which prints the numbers 1 through 10.

(Optional) Step 5: Make changes to a resource

In this optional step, you change the notebook’s code, redeploy the changed notebook, and then use the job to rerun the changed notebook.

If you do not want to make any changes to the notebook, skip ahead to Step 6: Clean up.

  1. Back in the __main.py__ file, change this line of code:

    content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8")
    

    To this, and then save the file:

      content_base64 = b64encode(b'''
    data = [
             { "Category": 'A', "ID": 1, "Value": 121.44 },
             { "Category": 'B', "ID": 2, "Value": 300.01 },
             { "Category": 'C', "ID": 3, "Value": 10.99 },
             { "Category": 'E', "ID": 4, "Value": 33.87}
           ]
    
    df = spark.createDataFrame(data)
    
    display(df)
    ''').decode("UTF-8")
    

    This change instructs the notebook to print the contents of the specified DataFrame instead of the numbers 1 through 10.

    Note

    Make sure that the lines of code beginning with data and ending with ''').decode("UTF-8") are flush with the edge of your code editor. Otherwise, Pulumi will insert whitespace into the notebook that may cause the new Python code to fail to run.

  2. Optionally, preview the resource that Pulumi will change by running the following command:

    pulumi preview
    

    If there are any errors reported, fix them and then run the command again.

    To view a detailed report in your Pulumi account online of what Pulumi will do, copy the View Live link that appears and paste it into your web browser’s address bar.

  3. Deploy the resource change to your Databricks workspace by running the following command:

    pulumi up
    
  4. When prompted to perform this update, press your up arrow key to navigate to yes and then press Enter. If there are any errors reported, fix them and then run the command again.

  5. To view a detailed report in your Pulumi account online of what Pulumi did, copy the View Live link that appears and paste it into your web browser’s address bar.

  6. To view the changed notebook in your workspace, copy the Notebook URL link that appears and paste it into your web browser’s address bar.

  7. To rerun the job with the changed notebook, copy the Job URL link that appears and paste it into your web browser’s address bar. Then click the Run now button on the job page.

  8. After the job finishes running, to view the job run’s results, in the Completed runs (past 60 days) list on the job page, click the most recent time entry in the Start time column. The Output pane shows the result of running the notebook’s code, which prints the contents of the specified DataFrame.

Step 6: Clean up

In this step, you instruct Pulumi to remove the notebook and job from your Databricks workspace as well as remove the pulumi-demo project and its dev stack from your Pulumi account online.

  1. Remove the resources from your Databricks workspace by running the following command:

    pulumi destroy
    
  2. When prompted to perform this removal, press your up arrow key to navigate to yes and then press Enter.

  3. Remove the Pulumi pulumi-demo project and its dev stack from your Pulumi account online by running the following command:

    pulumi stack rm dev
    
  4. When prompted to perform this removal, type dev and then press Enter.

  5. To deactivate the venv Python virtual environment, run the following command:

    deactivate