Terraform CDK Databricks Provider

Note

This article covers the Cloud Development Kit for Terraform (CDKTF), which is neither provided nor supported by Databricks. To contact the provider, see the Terraform Community.

This article shows you how to use Python along with the Terraform CDK Databricks Provider and the Cloud Development Kit for Terraform (CDKTF). The CDKTF is a third-party, infrastructure as code (IaC) platform that enables you to create, deploy, and manage Databricks resources by using familiar programming languages, tools, and engineering practices. Although this article shows you how to use Python, the CDKTF supports additional languages such as TypeScript, Java, C#, and Go.

The Terraform CDK Databricks provider is based on the Databricks Terraform provider. For more information, see Terraform Cloud. The CDKTF is based on the AWS Cloud Development Kit (AWS CDK).

Requirements

You must have a Databricks workspace, as this article deploys resources into an existing workspace.

On your local development machine, you must have the following installed:

  • Terraform, version 1.1 or higher. To check whether you have Terraform installed, and to check the installed version, run the command terraform -v from your terminal or with PowerShell. Install Terraform, if you do not have it already installed.

    terraform -v
    
  • Node.js, version 16.13 or higher, and npm. To check whether you have Node.js and npm installed, and to check the installed versions, run the commands node -v and npm -v. The latest versions of Node.js already include npm. Install Node.js and npm by using Node Version Manager (nvm), if you do not have Node.js and npm already installed.

    node -v
    npm -v
    
  • The CDKTF CLI. To check whether you have the CDKTF CLI installed, and to check the installed version, run the command cdktf --version. Install the CDKTF CLI by using npm, if you do not have it already installed.

    cdktf --version
    

    Tip

    You can also install the CDKTF CLI on macOS with Homebrew. See Install CDKTF.

  • Python version 3.7 or higher and pipenv version 2021.5.29 or higher. To check whether you have Python and pipenv installed, and to check the installed versions, run the commands python --version and pipenv --version. Install Python and install pipenv, if they are not already installed.

    python --version
    pipenv --version
    
  • Databricks authentication configured for the supported authentication type that you want to use. See Authentication in the Databricks Terraform provider documentation.

Step 1: Create a CDKTF project

In this step, on your local development machine you set up the necessary directory structure for a CDKTF project. You then create your CDKTF project within this directory structure.

  1. Create an empty directory for your CDKTF project, and then switch to it. Run the following commands in your terminal or with PowerShell:

    mkdir cdktf-demo
    cd cdktf-demo
    
    md cdktf-demo
    cd cdktf-demo
    
  2. Create a CDKTF project by running the following command:

    cdktf init --template=python --local
    
  3. When prompted for a Project Name, accept the default project name of cdktf-demo by pressing Enter.

  4. When prompted for a Project Description, accept the default project description by pressing Enter.

  5. If prompted Do you want to start from an existing Terraform project, enter N and press Enter.

  6. If prompted Do you want to send crash reports to the CDKTF team, enter n and press Enter.

The CDKTF creates the following files and subdirectories in your cdktf-demo directory:

  • .gitignore, which is a list of files and directories that Git ignores if you want to push this project into a remote Git repository.

  • cdktf.json, which contains configuration settings for your CDKTF project. See Configuration File for more information on configuration settings.

  • help, which contains information about some next steps you can take to work with your CDKTF project.

  • main-test.py, which contains supporting unit tests that you can write for your CDKTF project. See Unit Tests for more information on unit testing.

  • main.py, which contains the Python code that you write for your CDKTF project.

  • Pipfile and Pipfile.lock, which manage code dependencies for your CDKTF project.

Step 2: Define resources

In this step, you use the Terraform CDK Databricks provider to define a notebook and a job to run that notebook.

  1. Install the project dependencies: using pipenv, install into your CDKTF project the Terraform CDK Databricks Provider to generate Databricks resources. To do this, run the following:

    pipenv install cdktf-cdktf-provider-databricks
    
  2. Replace the contents of the main.py file with the following code. This code authenticates the CDKTF with your Databricks workspace, then generates a notebook along with a job to run the notebook. To view syntax documentation for this code, see the Terraform CDK Databricks provider construct reference for Python.

    #!/usr/bin/env python
    from constructs import Construct
    from cdktf import (
      App, TerraformStack, TerraformOutput
    )
    from cdktf_cdktf_provider_databricks import (
      data_databricks_current_user,
      job, notebook, provider
    )
    import vars
    from base64 import b64encode
    
    class MyStack(TerraformStack):
      def __init__(self, scope: Construct, ns: str):
        super().__init__(scope, ns)
    
        provider.DatabricksProvider(
          scope = self,
          id    = "databricksAuth"
        )
    
        current_user = data_databricks_current_user.DataDatabricksCurrentUser(
          scope     = self,
          id_       = "currentUser"
        )
        
        # Define the notebook.
        my_notebook = notebook.Notebook(
          scope          = self,
          id_            = "notebook",
          path           = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py",
          language       = "PYTHON",
          content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8")
        )
    
        # Define the job to run the notebook.
        my_job = job.Job(
          scope = self,
          id_ = "job",
          name = f"{vars.resource_prefix}-job",
          task = [ 
            job.JobTask(
              task_key = f"{vars.resource_prefix}-task",
              new_cluster = job.JobTaskNewCluster(
                num_workers   = vars.num_workers,
                spark_version = vars.spark_version,
                node_type_id  = vars.node_type_id
              ),
              notebook_task = job.JobTaskNotebookTask(
                notebook_path = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py"
              ),
              email_notifications = job.JobTaskEmailNotifications(
                on_success = [ current_user.user_name ],
                on_failure = [ current_user.user_name ]
              )
            )
          ]
        )
    
        # Output the notebook and job URLs.
        TerraformOutput(
          scope = self,
          id    = "Notebook URL",
          value = my_notebook.url
        )
    
        TerraformOutput(
          scope = self,
          id    = "Job URL",
          value = my_job.url
        )
    
    app = App()
    MyStack(app, "cdktf-demo")
    app.synth()
    
  3. Create a file named vars.py in the same directory as main.py. Replace the following values with your own values to specify a resource prefix and cluster settings such as the number of workers, Spark runtime version string, and node type.

    #!/usr/bin/env python
    resource_prefix = "cdktf-demo"
    num_workers     = 1
    spark_version   = "14.3.x-scala2.12"
    node_type_id    = "n1-standard-4"
    

Step 3: Deploy the resources

In this step, you use the CDKTF CLI to deploy, into your existing Databricks workspace, the defined notebook and the job to run that notebook.

  1. Generate the Terraform code equivalent for your CDKTF project. To do this, run the cdktf synth command.

    cdktf synth
    
  2. Before making changes, you can review the pending resource changes. Run the following:

    cdktf diff
    
  3. Deploy the notebook and job by running the cdktf deploy command.

    cdktf deploy
    
  4. When prompted to Approve, press Enter. Terraform creates and deploys the notebook and job into your workspace.

Step 4: Interact with the resources

In this step, you run the job in your Databricks workspace, which runs the specified notebook.

  1. To view the notebook that the job will run in your workspace, copy the Notebook URL link that appears in the output of the cdk deploy command and paste it into your web browser’s address bar.

  2. To view the job that runs the notebook in your workspace, copy the Job URL link that appears in the output of the cdk deploy command and paste it into your web browser’s address bar.

  3. To run the job, click the Run now button on the job page.

(Optional) Step 5: Make changes to a resource

In this optional step, you change the notebook’s code, redeploy the changed notebook, and then use the job to rerun the changed notebook.

If you do not want to make any changes to the notebook, skip ahead to Step 6: Clean up.

  1. In the main.py file, change the notebook variable declaration from the following:

    my_notebook = notebook.Notebook(
      scope          = self,
      id_            = "notebook",
      path           = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py",
      language       = "PYTHON",
      content_base64 = b64encode(b"display(spark.range(10))").decode("UTF-8")
    )
    

    To the following:

    my_notebook = notebook.Notebook(
      scope          = self,
      id_            = "notebook",
      path           = f"{current_user.home}/CDKTF/{vars.resource_prefix}-notebook.py",
      language       = "PYTHON",
      content_base64 = b64encode(b'''
    data = [
       { "Category": 'A', "ID": 1, "Value": 121.44 },
       { "Category": 'B', "ID": 2, "Value": 300.01 },
       { "Category": 'C', "ID": 3, "Value": 10.99 },
       { "Category": 'E', "ID": 4, "Value": 33.87}
    ]
    
    df = spark.createDataFrame(data)
    
    display(df)
    ''').decode("UTF-8")
    )
    

    Note

    Make sure that the lines of code between with triple quotes (''') are aligned with the edge of your code editor, as shown. Otherwise, Terraform will insert additional whitespace into the notebook that may cause the new Python code to fail to run.

  2. Regenerate the Terraform code equivalent for your CDKTF project. To do this, run the following:

    cdktf synth
    
  3. Before making changes, you can review the pending resource changes. Run the following:

    cdktf diff
    
  4. Deploy the notebook changes by running the cdktf deploy command.

    cdktf deploy
    
  5. When prompted to Approve, press Enter. Terraform changes the notebook’s contents.

  6. To view the changed notebook that the job will run in your workspace, refresh the notebook that you opened earlier, or copy the Notebook URL link that appears in the output of the cdk deploy command and paste it into your web browser’s address bar.

  7. To view the job that runs the changed notebook in your workspace, refresh the job that you opened earlier, or copy the Job URL link that appears in the output of the cdk deploy command and paste it into your web browser’s address bar.

  8. To run the job, click the Run now button on the job page.

Step 6: Clean up

In this step, you use the CDKTF CLI to remove the notebook and job from your Databricks workspace.

  1. Remove the resources from your workspace by running the cdktf destroy command:

    cdktf destroy
    
  2. When prompted to Approve, press Enter. Terraform removes the resources from your workspace.

Testing

You can test your CDKTF project before you deploy it. See Unit Tests in the CDKTF documentation.

For Python-based CDKTF projects, you can write and run tests by using the Python test framework pytest along with the cdktf package’s Testing class. The following example file named test_main.py tests the CDKTF code in this article’s preceding main.py file. The first test checks whether the project’s notebook will contain the expected Base64-encoded representation of the notebook’s content. The second test checks whether the project’s job will contain the expected job name. To run these tests, run the pytest command from the project’s root directory.

from cdktf import App, Testing
from cdktf_cdktf_provider_databricks import job, notebook
from main import MyStack

class TestMain:
  app = App()
  stack = MyStack(app, "cdktf-demo")
  synthesized = Testing.synth(stack)

  def test_notebook_should_have_expected_base64_content(self):
    assert Testing.to_have_resource_with_properties(
      received = self.synthesized,
      resource_type = notebook.Notebook.TF_RESOURCE_TYPE,
      properties = {
        "content_base64": "ZGlzcGxheShzcGFyay5yYW5nZSgxMCkp"
      }
    )

  def test_job_should_have_expected_job_name(self):
    assert Testing.to_have_resource_with_properties(
      received = self.synthesized,
      resource_type = job.Job.TF_RESOURCE_TYPE,
      properties = {
        "name": "cdktf-demo-job"
      }
    )