Use an IDE with Databricks

You can use third-party integrated development environments (IDEs) for software development with Databricks. Some of these IDEs include the following:

You use these IDEs to do software development in programming languages that Databricks supports, including the following languages:

To demonstrate how this can work, this article describes a Python-based code sample that you can work with in any Python-compatible IDE. Specifically, this article describes how to work with this code sample in Visual Studio Code, which provides the following developer productivity features:

This article uses dbx by Databricks Labs along with Visual Studio Code to submit the code sample to a remote Databricks workspace. dbx instructs Databricks to Orchestrate data processing workflows on Databricks to run the submitted code on a Databricks jobs cluster in that workspace.

You can use popular third-party Git providers for version control and continuous integration and continuous delivery or continuous deployment (CI/CD) of your code. For version control, these Git providers include the following:

For CI/CD, dbx supports the following CI/CD platforms:

To demonstrate how version control and CI/CD can work, this article describes how to use Visual Studio Code, dbx, and this code sample, along with GitHub and GitHub Actions.

Code sample requirements

To use this code sample, you must have the following:

Additionally, on your local development machine, you must have the following:

  • Python version 3.8 or above.

    You should use a version of Python that matches the one that is installed on your target clusters. To get the version of Python that is installed on an existing cluster, you can use the cluster’s web terminal to run the python --version command. See also the “System environment” section in the Databricks runtime releases for the Databricks Runtime version for your target clusters. In any case, the version of Python must be 3.8 or above.

    To get the version of Python that is currently referenced on your local machine, run python --version from your local terminal. (Depending on how you set up Python on your local machine, you may need to run python3 instead of python throughout this article.) See also Select a Python interpreter.

  • pip. pip is automatically installed with newer versions of Python. To check whether pip is already installed, run pip --version from your local terminal. (Depending on how you set up Python or pip on your local machine, you may need to run pip3 instead of pip throughout this article.)

  • dbx version 0.7.0 or above. You can install the dbx package from the Python Package Index (PyPI) by running pip install dbx.

    Note

    You do not need to install dbx now. You can install it later in the code sample setup section.

  • A method to create Python virtual environments to ensure you are using the correct versions of Python and package dependencies in your dbx projects. This article covers pipenv.

  • The Databricks CLI, set up with authentication.

    Note

    You do not need to install the Databricks CLI now. You can install it later in the code sample setup section. If you want to install it later, you must remember to set up authentication at that time instead.

  • Visual Studio Code.

  • The Python extension for Visual Studio Code.

  • The GitHub Pull Requests and Issues extension for Visual Studio Code.

  • Git.

About the code sample

The Python code sample for this article, available in the databricks/ide-best-practices repo in GitHub, does the following:

  1. Gets data from the owid/covid-19-data repo in GitHub.

  2. Filters the data for a specific ISO country code.

  3. Creates a pivot table from the data.

  4. Performs data cleansing on the data.

  5. Modularizes the code logic into reusable functions.

  6. Unit tests the functions.

  7. Provides dbx project configurations and settings to enable the code to write the data to a Delta table in a remote Databricks workspace.

Set up the code sample

After you have the requirements in place for this code sample, complete the following steps to begin using the code sample.

Note

These steps do not include setting up this code sample for CI/CD. You do not need to set up CI/CD to run this code sample. If you want to set up CI/CD later, see Run with GitHub Actions.

Step 1: Create a Python virtual environment

  1. From your terminal, create a blank folder to contain a virtual environment for this code sample. These instructions use a parent folder named ide-demo. You can give this folder any name you want. If you use a different name, replace the name throughout this article. After you create the folder, switch to it, and then start Visual Studio Code from that folder. Be sure to include the dot (.) after the code command.

    For Linux and macOS:

    mkdir ide-demo
    cd ide-demo
    code .
    

    Tip

    If you get the error command not found: code, see Launching from the command line on the Microsoft website.

    For Windows:

    md ide-demo
    cd ide-demo
    code .
    
  2. In Visual Studio Code, on the menu bar, click View > Terminal.

  3. From the root of the ide-demo folder, run the pipenv command with the following option, where <version> is the target version of Python that you already have installed locally (and, ideally, a version that matches your target clusters’ version of Python), for example 3.8.10.

    pipenv --python <version>
    

    Make a note of the Virtualenv location value in the output of the pipenv command, as you will need it in the next step.

  4. Select the target Python interpreter, and then activate the Python virtual environment:

    1. On the menu bar, click View > Command Palette, type Python: Select, and then click Python: Select Interpreter.

    2. Select the Python interpreter within the path to the Python virtual environment that you just created. (This path is listed as the Virtualenv location value in the output of the pipenv command.)

    3. On the menu bar, click View > Command Palette, type Terminal: Create, and then click Terminal: Create New Terminal.

    4. Make sure that the command prompt indicates that you are in the pipenv shell. To confirm, you should see something like (<your-username>) before your command prompt. If you do not see it, run the following command:

      pipenv shell
      

      To exit the pipenv shell, run the command exit, and the parentheses disappear.

    For more information, see Using Python environments in VS Code in the Visual Studio Code documentation.

Step 2: Clone the code sample from GitHub

  1. In Visual Studio Code, open the ide-demo folder (File > Open Folder), if it is not already open.

  2. Click View > Command Palette, type Git: Clone, and then click Git: Clone.

  3. For Provide repository URL or pick a repository source, enter https://github.com/databricks/ide-best-practices

  4. Browse to your ide-demo folder, and click Select Repository Location.

Step 3: Install the code sample’s dependencies

  1. Install a version of dbx and the Databricks CLI that is compatible with your version of Python. To do this, in Visual Studio Code from your terminal, from your ide-demo folder with a pipenv shell activated (pipenv shell), run the following command:

    pip install dbx
    
  2. Confirm that dbx is installed. To do this, run the following command:

    dbx --version
    

    If the version number is returned, dbx is installed.

    If the version number is below 0.7.0, upgrade dbx by running the following command, and then check the version number again:

    pip install dbx --upgrade
    dbx --version
    
    # Or ...
    python -m pip install dbx --upgrade
    dbx --version
    
  3. When you install dbx, the Databricks CLI is also automatically installed. To confirm that the Databricks CLI is installed, run the following command:

    databricks --version
    

    If the version number is returned, the Databricks CLI is installed.

  4. If you have not set up the Databricks CLI with authentication, you must do it now. To confirm that authentication is set up, run the following basic command to get some summary information about your Databricks workspace. Be sure to include the forward slash (/) after the ls subcommand:

    databricks workspace ls /
    

    If a list of root-level folder names for your workspace is returned, authentication is set up.

  5. Install the Python packages that this code sample depends on. To do this, run the following command from the ide-demo/ide-best-practices folder:

    pip install -r unit-requirements.txt
    
  6. Confirm that the code sample’s dependent packages are installed. To do this, run the following command:

    pip list
    

    If the packages that are listed in the requirements.txt and unit-requirements.txt files are somewhere in this list, the dependent packages are installed.

    Note

    The files listed in requirements.txt are for specific package versions. For better compatibility, you can cross-reference these versions with the cluster node type that you want your Databricks workspace to use for running deployments on later. See the “System environment” section for your cluster’s Databricks Runtime version in Databricks runtime releases.

Step 4: Customize the code sample for your Databricks workspace

  1. Customize the repo’s dbx project settings. To do this, in the .dbx/project.json file, change the value of the profile object from DEFAULT to the name of the profile that matches the one that you set up for authentication with the Databricks CLI. If you did not set up any non-default profile, leave DEFAULT as is. For example:

    {
      "environments": {
        "default": {
          "profile": "DEFAULT",
          "storage_type": "mlflow",
          "properties": {
            "workspace_directory": "/Shared/dbx/covid_analysis",
            "artifact_location": "dbfs:/Shared/dbx/projects/covid_analysis"
          }
        }
      },
      "inplace_jinja_support": false
    }
    
  2. Customize the dbx project’s deployment settings. To do this, in the conf/deployment.yml file, change the value of the spark_version and node_type_id objects from 10.4.x-scala2.12 and m6gd.large to the Databricks runtime version string and cluster node type that you want your Databricks workspace to use for running deployments on.

    For example, to specify Databricks Runtime 10.4 LTS and an n1-highmem-4 node type:

    environments:
      default:
        workflows:
          - name: "covid_analysis_etl_integ"
            new_cluster:
              spark_version: "10.4.x-scala2.12"
              num_workers: 1
            node_type_id: "n1-highmem-4"
            spark_python_task:
              python_file: "file://jobs/covid_trends_job.py"
          - name: "covid_analysis_etl_prod"
            new_cluster:
              spark_version: "10.4.x-scala2.12"
              num_workers: 1
              node_type_id: "n1-highmem-4"
              spark_python_task:
                python_file: "file://jobs/covid_trends_job.py"
              parameters: ["--prod"]
          - name: "covid_analysis_etl_raw"
            new_cluster:
              spark_version: "10.4.x-scala2.12"
              num_workers: 1
              node_type_id: "n1-highmem-4"
              spark_python_task:
                python_file: "file://jobs/covid_trends_job_raw.py"
    

Tip

In this example, each of these three job definitions has the same spark_version and node_type_id value. You can use different values for different job definitions. You can also create shared values and reuse them across job definitions, to reduce typing errors and code maintenance. See the YAML example in the dbx documentation.

Explore the code sample

After you set up the code sample, use the following information to learn about how the various files in the ide-demo/ide-best-practices folder work.

Code modularization

Unmodularized code

The jobs/covid_trends_job_raw.py file is an unmodularized version of the code logic. You can run this file by itself.

Modularized code

The jobs/covid_trends_job.py file is a modularized version of the code logic. This file relies on the shared code in the covid_analysis/transforms.py file. The covid_analysis/__init__.py file treats the covide_analysis folder as a containing package.

Testing

Unit tests

The tests/testdata.csv file contains a small portion of the data in the covid-hospitalizations.csv file for testing purposes. The tests/transforms_test.py file contains the unit tests for the covid_analysis/transforms.py file.

Unit test runner

The pytest.ini file contains configuration options for running tests with pytest. See pytest.ini and Configuration Options in the pytest documentation.

The .coveragerc file contains configuration options for Python code coverage measurements with coverage.py. See Configuration reference in the coverage.py documentation.

The requirements.txt file, which is a subset of the unit-requirements.txt file that you ran earlier with pip, contains a list of packages that the unit tests also depend on.

Packaging

The setup.py file provides commands to be run at the console (console scripts), such as the pip command, for packaging Python projects with setuptools. See Entry Points in the setuptools documentation.

Other files

There are other files in this code sample that have not been previously described:

  • The .github/workflows folder contains three files, databricks_pull_request_tests.yml, onpush.yml, and onrelease.yaml, that represent the GitHub Actions, which are covered later in the GitHub Actions section.

  • The .gitignore file contains a list of local folders and files that Git ignores for your repo.

Run the code sample

You can use dbx on your local machine to instruct Databricks to run the code sample in your remote workspace on-demand, as described in the next subsection. Or you can use GitHub Actions to have GitHub run the code sample every time you push code changes to your GitHub repo.

Run with dbx

  1. Install the contents of the covid_analysis folder as a package in Python setuptools development mode by running the following command from the root of your dbx project (for example, the ide-demo/ide-best-practices folder). Be sure to include the dot (.) at the end of this command:

    pip install -e .
    

    This command creates a covid_analysis.egg-info folder, which contains information about the compiled version of the covid_analysis/__init__.py and covid_analysis/transforms.py files.

  2. Run the tests by running the following command:

    pytest tests/
    

    The tests’ results are displayed in the terminal. All four tests should show as passing.

  3. Optionally, get test coverage metrics for your tests by running the following command:

    coverage run -m pytest tests/
    

    Note

    If a message displays that coverage cannot be found, run pip install coverage, and try again.

    To view test coverage results, run the following command:

    coverage report -m
    
  4. If all four tests pass, send the dbx project’s contents to your Databricks workspace, by running the following command:

    dbx deploy --environment=default
    

    Information about the project and its runs are sent to the location specified in the workspace_directory object in the .dbx/project.json file.

    The project’s contents are sent to the location specified in the artifact_location object in the .dbx/project.json file.

  5. Run the pre-production version of the code in your workspace, by running the following command:

    dbx launch covid_analysis_etl_integ
    

    A link to the run’s results are displayed in the terminal. It should look something like this:

    https://<your-workspace-instance-id>/?o=1234567890123456#job/123456789012345/run/12345
    

    Follow this link in your web browser to see the run’s results in your workspace.

  6. Run the production version of the code in your workspace, by running the following command:

    dbx launch covid_analysis_etl_prod
    

    A link to the run’s results are displayed in the terminal. It should look something like this:

    https://<your-workspace-instance-id>/?o=1234567890123456#job/123456789012345/run/23456
    

    Follow this link in your web browser to see the run’s results in your workspace.

Run with GitHub Actions

In the project’s .github/workflows folder, the onpush.yml and onrelease.yml GitHub Actions files do the following:

  • On each push to a tag that begins with v, uses dbx to deploy the covid_analysis_etl_prod job.

  • On each push that is not to a tag that begins with v:

    1. Uses pytest to run the unit tests.

    2. Uses dbx to deploy the file specified in the covid_analysis_etl_integ job to the remote workspace.

    3. Uses dbx to launch the already-deployed file specified in the covid_analysis_etl_integ job on the remote workspace, tracing this run until it finishes.

Note

An additional GitHub Actions file, databricks_pull_request_tests.yml, is provided for you as a template to experiment with, without impacting the onpush.yml and onrelease.yml GitHub Actions files. You can run this code sample without the databricks_pull_request_tests.yml GitHub Actions file. Its usage is not covered in this article.

The following subsections describe how to set up and run the onpush.yml and onrelease.yml GitHub Actions files.

Set up to use GitHub Actions

Set up your Databricks workspace by adding a user to your workspace that will be used only for authenticating with your GitHub repo. After you add the user, Generate a personal access token for the new user.

Note

When you create a new Databricks workspace user, you cannot associate it with the email address for your own Databricks user. Instead, see your organization’s email administrator about getting a separate email address that you can associate with this new Databricks workspace user.

See your organization’s account administrator about managing the separate email address and its associated personal access token within your organization.

As a security best practice, Databricks recommends that you use a Databricks personal access token for a unique Databricks workspace user, instead of the Databricks personal access token for your workspace user, for enabling GitHub to authenticate with your Databricks workspace.

After you create the unique Databricks workspace user and its Databricks personal access token, stop and make a note of the Databricks personal access token value, which you will you use in the next section.

Run GitHub Actions

Step 1: Publish your cloned repo
  1. In Visual Studio Code, in the sidebar, click the GitHub icon. If the icon is not visible, enable the GitHub Pull Requests and Issues extension through the Extensions view (View > Extensions) first.

  2. If the Sign In button is visible, click it, and follow the on-screen instructions to sign in to your GitHub account.

  3. On the menu bar, click View > Command Palette, type Publish to GitHub, and then click Publish to GitHub.

  4. Select an option to publish your cloned repo to your GitHub account.

Step 2: Add encrypted secrets to your repo

In the GitHub website for your published repo, follow the instructions in Creating encrypted secrets for a repository, for the following encrypted secrets:

  • Create an encrypted secret named DATABRICKS_HOST, set to the value of your workspace instance URL, for example https://1234567890123456.7.gcp.databricks.com.

  • Create an encrypted secret named DATABRICKS_TOKEN, set to the value of the Databricks personal access token for the unique Databricks workspace user.

Step 3: Create and publish a branch to your repo
  1. In Visual Studio Code, in Source Control view (View > Source Control), click the (Views and More Actions) icon.

  2. Click Branch > Create Branch From.

  3. Enter a name for the branch, for example my-branch.

  4. Select the branch to create the branch from, for example main.

  5. Make a minor change to one of the files in your local repo, and then save the file. For example, make a minor change to a code comment in the tests/transforms_test.py file.

  6. In Source Control view, click the (Views and More Actions) icon again.

  7. Click Changes > Stage All Changes.

  8. Click the (Views and More Actions) icon again.

  9. Click Commit > Commit Staged.

  10. Enter a message for the commit.

  11. Click the (Views and More Actions) icon again.

  12. Click Branch > Publish Branch.

Step 4: Create a pull request and merge
  1. Go to the GitHub website for your published repo, https://github/<your-GitHub-username>/ide-best-practices.

  2. On the Pull requests tab, next to my-branch had recent pushes, click Compare & pull request.

  3. Click Create pull request.

  4. On the pull request page, wait for the icon next to CI pipleline / ci-pipeline (push) to display a green check mark. (It may take a few moments for the several minutes for the icon to appear.) If there is a red X instead of a green check mark, click Details to find out why. If the icon or Details are no longer showing, click Show all checks.

  5. If the green check mark appears, merge the pull request into the main branch by clicking Merge pull request.