Install Databricks Connect for Python
Note
This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.
This article describes how to install Databricks Connect for Python. See What is Databricks Connect?. For the Scala version of this article, see Install Databricks Connect for Scala.
Requirements
To install Databricks Connect for Python, the following requirements must be met:
If you are connecting to serverless compute, your workspace must meet the requirements for serverless compute.
Note
Serverless compute is supported in Databricks Connect version 15.1 and above. In addition, Databricks Connect versions at or lower than the Databricks Runtime release on serverless are fully compatible. See Release notes. To verify if the Databricks Connect version is compatible with serverless compute, see Validate the connection to Databricks.
If you are connecting to a cluster, your target cluster must meet the cluster configuration requirements, which includes Databricks Runtime version requirements.
You must have Python 3 installed on your development machine, and the minor version of Python installed on your development machine must meet the version requirements in the table below.
Compute type
Databricks Connect version
Compatible Python version
Cluster
13.3 LTS to 14.3 LTS
3.10
Cluster
15.1 and above
3.11
Serverless
15.1 and above
3.10
If you want to use PySpark UDFs, your development machine’s installed minor version of Python must match the minor version of Python that is included with the Databricks Runtime installed on the cluster or serverless compute. To find the minor Python version of your cluster, refer to the System environment section of the Databricks Runtime release notes for your cluster or serverless compute. See Databricks Runtime release notes versions and compatibility and Serverless compute release notes.
Activate a Python virtual environment
Databricks strongly recommends that you have a Python virtual environment activated for each Python version that you use with Databricks Connect. Python virtual environments help to make sure that you are using the correct versions of Python and Databricks Connect together. For more information about these tools and how to activate them, see venv or Poetry.
Install the Databricks Connect client
This section describes how to install the Databricks Connect client with venv or Poetry.
Note
If you already have the Databricks extension for Visual Studio Code installed, you do not need to follow these setup instructions, because the Databricks extension for Visual Studio Code already has built-in support for Databricks Connect for Databricks Runtime 13.3 LTS and above. Skip to Debug code using Databricks Connect for the Databricks extension for Visual Studio Code.
Install the Databricks Connect client with venv
With your virtual environment activated, uninstall PySpark, if it is already installed, by running the
uninstall
command. This is required because thedatabricks-connect
package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run theshow
command.# Is PySpark already installed? pip3 show pyspark # Uninstall PySpark pip3 uninstall pyspark
With your virtual environment still activated, install the Databricks Connect client by running the
install
command. Use the--upgrade
option to upgrade any existing client installation to the specified version.pip3 install --upgrade "databricks-connect==14.3.*" # Or X.Y.* to match your cluster version.
Note
Databricks recommends that you append the “dot-asterisk” notation to specify
databricks-connect==X.Y.*
instead ofdatabricks-connect=X.Y
, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.
Skip ahead to Configure connection properties.
Install the Databricks Connect client with Poetry
With your virtual environment activated, uninstall PySpark, if it is already installed, by running the
remove
command. This is required because thedatabricks-connect
package conflicts with PySpark. For details, see Conflicting PySpark installations. To check whether PySpark is already installed, run theshow
command.# Is PySpark already installed? poetry show pyspark # Uninstall PySpark poetry remove pyspark
With your virtual environment still activated, install the Databricks Connect client by running the
add
command.poetry add databricks-connect@~14.3 # Or X.Y to match your cluster version.
Note
Databricks recommends that you use the “at-tilde” notation to specify
databricks-connect@~14.3
instead ofdatabricks-connect==14.3
, to make sure that the most recent package is installed. While this is not a requirement, it helps make sure that you can use the latest supported features for that cluster.
Configure connection properties
In this section, you configure properties to establish a connection between Databricks Connect and your Databricks cluster or serverless compute, which includes the following:
The Databricks workspace instance name. This is the Server Hostname value for your compute. See Get connection details for a Databricks compute resource.
Any other properties that are necessary for the Databricks authentication type that you want to use.
Note
OAuth user-to-machine (U2M) authentication and OAuth machine-to-machine (M2M) authentication are supported on Databricks SDK for Python 0.19.0 and above. You might need to update your code project’s installed version of the Databricks SDK for Python to 0.19.0 or above to use OAuth U2M or M2M authentication. See Get started with the Databricks SDK for Python.
For OAuth U2M authentication, you must use the Databricks CLI to authenticate before you run your Python code. See Tutorial.
Google Cloud credentials authentication and Google Cloud ID authentication are supported on Databricks SDK for Python 0.14.0 and above. You might need to update your code project’s installed version of the Databricks SDK for Python to 0.14.0 or above to use Google Cloud credentials authentication or ID authentication. See Get started with the Databricks SDK for Python.
Configure a connection to a cluster
To configure a connection to a cluster, you will need the ID of your cluster. You can obtain the cluster ID from the URL. See Cluster URL and ID.
You can configure the connection to your cluster in one of the following ways. Databricks Connect searches for configuration properties in the following order, and uses the first configuration it finds. For advanced configuration information, see Advanced usage of Databricks Connect for Python.
The DatabricksSession
class’s remote()
method
For this option, which applies to Databricks personal access token authentication only, specify the workspace instance name, the Databricks personal access token, and the ID of the cluster.
You can initialize the DatabricksSession
class in several ways, as follows:
Set the
host
,token
, andcluster_id
fields inDatabricksSession.builder.remote()
.Use the Databricks SDK’s
Config
class.Specify a Databricks configuration profile along with the
cluster_id
field.Set the Spark Connect connection string in
DatabricksSession.builder.remote()
.
Instead of specifying these connection properties in your code, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed retrieve_*
functions to get the necessary properties from the user or from some other configuration store, such as Google Cloud Secret Manager.
The code for each of these approaches is as follows:
# Set the host, token, and cluster_id fields in DatabricksSession.builder.remote.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.remote(
host = f"https://{retrieve_workspace_instance_name()}",
token = retrieve_token(),
cluster_id = retrieve_cluster_id()
).getOrCreate()
# Use the Databricks SDK's Config class.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config
config = Config(
host = f"https://{retrieve_workspace_instance_name()}",
token = retrieve_token(),
cluster_id = retrieve_cluster_id()
)
spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
# Specify a Databricks configuration profile along with the `cluster_id` field.
# If you have already set the DATABRICKS_CLUSTER_ID environment variable with the
# cluster's ID, you do not also need to set the cluster_id field here.
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config
config = Config(
profile = "<profile-name>",
cluster_id = retrieve_cluster_id()
)
spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
A Databricks configuration profile
For this option, create or identify a Databricks configuration profile containing the field cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.
The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For OAuth machine-to-machine (M2M) authentication (where supported):
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication (where supported):
host
.For Google Cloud credentials authentication (where supported):
host
andgoogle_credentials
.For Google Cloud ID authentication (where supported):
host
andgoogle_service_acccount
.
Then set the name of this configuration profile through the Config
class.
You can specify cluster_id
in a few ways, as follows:
Include the
cluster_id
field in your configuration profile, and then just specify the configuration profile’s name.Specify the configuration profile name along with the
cluster_id
field.
If you have already set the DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specify cluster_id
.
The code for each of these approaches is as follows:
# Include the cluster_id field in your configuration profile, and then
# just specify the configuration profile's name:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.profile("<profile-name>").getOrCreate()
# Specify the configuration profile name along with the cluster_id field.
# In this example, retrieve_cluster_id() assumes some custom implementation that
# you provide to get the cluster ID from the user or from some other
# configuration store:
from databricks.connect import DatabricksSession
from databricks.sdk.core import Config
config = Config(
profile = "<profile-name>",
cluster_id = retrieve_cluster_id()
)
spark = DatabricksSession.builder.sdkConfig(config).getOrCreate()
The DATABRICKS_CONFIG_PROFILE
environment variable
For this option, create or identify a Databricks configuration profile containing the field cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.
If you have already set the DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specify cluster_id
.
The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For OAuth machine-to-machine (M2M) authentication (where supported):
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication (where supported):
host
.For Google Cloud credentials authentication (where supported):
host
andgoogle_credentials
.For Google Cloud ID authentication (where supported):
host
andgoogle_service_acccount
.
Set the DATABRICKS_CONFIG_PROFILE
environment variable to the name of this configuration profile. Then initialize the DatabricksSession
class as follows:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
An environment variable for each configuration property
For this option, set the DATABRICKS_CLUSTER_ID
environment variable and any other environment variables that are necessary for the Databricks authentication type that you want to use.
The required environment variables for each authentication type are as follows:
For Databricks personal access token authentication:
DATABRICKS_HOST
andDATABRICKS_TOKEN
.For OAuth machine-to-machine (M2M) authentication (where supported):
DATABRICKS_HOST
,DATABRICKS_CLIENT_ID
,DATABRICKS_CLIENT_SECRET
.For OAuth user-to-machine (U2M) authentication (where supported):
DATABRICKS_HOST
.For Google Cloud credentials authentication (where supported):
DATABRICKS_HOST
andGOOGLE_CREDENTIALS
.For Google Cloud ID authentication (where supported):
DATABRICKS_HOST
andGOOGLE_SERVICE_ACCOUNT
.
Then initialize the DatabricksSession
class as follows:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
A Databricks configuration profile named DEFAULT
For this option, create or identify a Databricks configuration profile containing the field cluster_id
and any other fields that are necessary for the Databricks authentication type that you want to use.
If you have already set the DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specify cluster_id
.
The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For OAuth machine-to-machine (M2M) authentication (where supported):
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication (where supported):
host
.For Google Cloud credentials authentication (where supported):
host
andgoogle_credentials
.For Google Cloud ID authentication (where supported):
host
andgoogle_service_acccount
.
Name this configuration profile DEFAULT
.
Then initialize the DatabricksSession
class as follows:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
Configure a connection to serverless compute
Preview
This feature is in Public Preview.
Databricks Connect supports connecting to serverless compute. To use this feature, requirements for connecting to serverless must be met. See Requirements.
Important
This feature has the following limitations:
All of the Databricks Connect for Python limitations
All of the serverless compute limitations
Only Python dependencies that are included as part of serverless compute environment can be used for UDFs. See System environment. Additional dependencies cannot be installed.
UDFs with custom modules are not supported.
You can configure a connection to serverless compute in one of the following ways:
Set the local environment variable
DATABRICKS_SERVERLESS_COMPUTE_ID
toauto
. If this environment variable is set, Databricks Connect ignores thecluster_id
.In a local Databricks configuration profile, set
serverless_compute_id = auto
, then reference that profile from your Databricks Connect Python code.[DEFAULT] host = https://my-workspace.cloud.databricks.com/ serverless_compute_id = auto token = dapi123...
Alternatively, just update your Databricks Connect Python code as follows:
from databricks.connect import DatabricksSession as SparkSession spark = DatabricksSession.builder.serverless(True).getOrCreate()
from databricks.connect import DatabricksSession as SparkSession spark DatabricksSession.builder.remote(serverless=True).getOrCreate()
Note
The serverless compute session times out after 10 minutes of inactivity. After this, the Python process needs to be restarted on the client side to create a new Spark session to connect to serverless compute.
Validate the connection to Databricks
To validate your environment, default credentials, and connection to compute are correctly set up for Databricks Connect, run the databricks-connect test
command, which fails with a non-zero exit code and a corresponding error message when it detects any incompatibility in the setup.
databricks-connect test
You can also validate your environment in your Python code using validateSession()
:
DatabricksSession.builder.validateSession(True).getOrCreate()