Databricks Connect for Python

Note

This article covers Databricks Connect for Databricks Runtime 13.0 and above.

This article demonstrates how to quickly get started with Databricks Connect by using Python and PyCharm.

Databricks Connect enables you to connect popular IDEs such as PyCharm, notebook servers, and other custom applications to Databricks clusters. See What is Databricks Connect?.

Tutorial

To skip this tutorial and use a different IDE instead, see Next steps.

Requirements

To complete this tutorial, you must meet the following requirements:

  • Your target Databricks workspace and cluster must meet the requirements for Cluster configuration for Databricks Connect.

  • You must have your cluster ID available. To get your cluster ID, in your workspace, click Compute on the sidebar, and then click your cluster’s name. In your web browser’s address bar, copy the string of characters between clusters and configuration in the URL.

  • You have PyCharm installed.

  • You have Python 3 installed on your development machine, and the minor version of your client Python installation is the same as the minor Python version of your Databricks cluster. The following table shows the Python version installed with each Databricks Runtime.

    Databricks Runtime version

    Python version

    13.0 ML - 14.3 ML, 13.0 - 14.3

    3.10

Step 1: Configure Databricks authentication

This tutorial uses Databricks OAuth user-to-machine (U2M) authentication and a Databricks configuration profile for authenticating with your Databricks workspace. To use a different authentication type instead, see Configure connection properties.

Configuring OAuth U2M authentication requires the Databricks CLI, as follows:

  1. If it is not already installed, install the Databricks CLI as follows:

    Use Homebrew to install the Databricks CLI by running the following two commands:

    brew tap databricks/tap
    brew install databricks
    

    You can use winget, Chocolatey or Windows Subsystem for Linux (WSL) to install the Databricks CLI. If you cannot use winget, Chocolatey, or WSL, you should skip this procedure and use the Command Prompt or PowerShell to install the Databricks CLI from source instead.

    Note

    Installing the Databricks CLI with Chocolatey is Experimental.

    To use winget to install the Databricks CLI, run the following two commands, and then restart your Command Prompt:

    winget search databricks
    winget install Databricks.DatabricksCLI
    

    To use Chocolatey to install the Databricks CLI, run the following command:

    choco install databricks-cli
    

    To use WSL to install the Databricks CLI:

    1. Install curl and zip through WSL. For more information, see your operating system’s documentation.

    2. Use WSL to install the Databricks CLI by running the following command:

      curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
      
  2. Confirm that the Databricks CLI is installed by running the following command, which displays the current version of the installed Databricks CLI. This version should be 0.205.0 or above:

    databricks -v
    

    Note

    If you run databricks but get an error such as command not found: databricks, or if you run databricks -v and a version number of 0.18 or below is listed, this means that your machine cannot find the correct version of the Databricks CLI executable. To fix this, see Verify your CLI installation.

Initiate OAuth U2M authentication, as follows:

  1. Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.

    In the following command, replace <workspace-url> with your Databricks workspace instance URL, for example https://1234567890123456.7.gcp.databricks.com.

    databricks auth login --configure-cluster --host <workspace-url>
    
  2. The Databricks CLI prompts you to save the information that you entered as a Databricks configuration profile. Press Enter to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.

    To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command databricks auth profiles. To view a specific profile’s existing settings, run the command databricks auth env --profile <profile-name>.

  3. In your web browser, complete the on-screen instructions to log in to your Databricks workspace.

  4. In the list of available clusters that appears in your terminal or command prompt, use your up arrow and down arrow keys to select the target Databricks cluster in your workspace, and then press Enter. You can also type any part of the cluster’s display name to filter the list of available clusters.

  5. To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:

    • databricks auth token --host <workspace-url>

    • databricks auth token -p <profile-name>

    • databricks auth token --host <workspace-url> -p <profile-name>

    If you have multiple profiles with the same --host value, you might need to specify the --host and -p options together to help the Databricks CLI find the correct matching OAuth token information.

Step 2: Create the project

  1. Start PyCharm.

  2. On the main menu, click File > New Project.

  3. For Location, click the folder icon, and complete the on-screen directions to specify the path to your new Python project.

  4. Expand Python interpreter: New environment.

  5. Click the New environment using option.

  6. In the drop-down list, select Virtualenv.

  7. Leave Location with the suggested path to the venv folder.

  8. For Base interpreter, use the drop-down list or click the ellipses to specify the path to the Python interpreter from the preceding requirements.

  9. Click Create.

Create the PyCharm project

Step 3: Add the Databricks Connect package

  1. On PyCharm’s main menu, click View > Tool Windows > Python Packages.

  2. In the search box, enter databricks-connect.

  3. In the PyPI repository list, click databricks-connect.

  4. In the result pane’s latest drop-down list, select the version that matches your cluster’s Databricks Runtime version. For example, if your cluster has Databricks Runtime 13.3 LTS installed, select 13.3.0.

  5. Click Install.

  6. After the package installs, you can close the Python Packages window.

Install the Databricks Connect package

Step 4: Add code

  1. In the Project tool window, right-click the project’s root folder, and click New > Python File.

  2. Enter main.py and double-click Python file.

  3. Enter the following code into the file and then save the file, depending on the name of your configuration profile.

    If your configuration profile from Step 1 is named DEFAULT, enter the following code into the file, and then save the file:

    from databricks.connect import DatabricksSession
    
    spark = DatabricksSession.builder.getOrCreate()
    
    df = spark.read.table("samples.nyctaxi.trips")
    df.show(5)
    

    If your configuration profile from Step 1 is not named DEFAULT, enter the following code into the file instead. Replace the placeholder <profile-name> with the name of your configuration profile from Step 1, and then save the file:

    from databricks.connect import DatabricksSession
    
    spark = DatabricksSession.builder.profile("<profile-name>").getOrCreate()
    
    df = spark.read.table("samples.nyctaxi.trips")
    df.show(5)
    

Step 5: Run the code

  1. Start the target cluster in your remote Databricks workspace.

  2. After the cluster has started, on the main menu, click Run > Run ‘main’.

  3. In the Run tool window (View > Tool Windows > Run), in the Run tab’s main pane, the first 5 rows of the samples.nyctaxi.trips appear.

Step 6: Debug the code

  1. With the cluster still running, in the preceding code, click the gutter next to df.show(5) to set a breakpoint.

  2. On the main menu, click Run > Debug ‘main’.

  3. In the Debug tool window (View > Tool Windows > Debug), in the Debugger tab’s Variables pane, expand the df and spark variable nodes to browse information about the code’s df and spark variables.

  4. In the Debug tool window’s sidebar, click the green arrow (Resume Program) icon.

  5. In the Debugger tab’s Console pane, the first 5 rows of the samples.nyctaxi.trips appear.

Debug the PyCharm project

Next steps

To learn more about Databricks Connect, see articles such as the following: