Install Databricks Connect for Scala
Note
This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.
This article describes how to install Databricks Connect for Scala. See What is Databricks Connect?. For the Python version of this article, see Install Databricks Connect for Python.
Requirements
Your target Databricks workspace and cluster must meet the requirements for Cluster configuration for Databricks Connect.
The Java Development Kit (JDK) installed on your development machine. Databricks recommends that the version of your JDK installation that you use matches the JDK version on your Databricks cluster. To find the JDK version on your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. For instance,
Zulu 8.70.0.23-CA-linux64
corresponds to JDK 8. See Databricks Runtime release notes versions and compatibility.Scala installed on your development machine. Databricks recommends that the version of your Scala installation you use matches the Scala version on your Databricks cluster. To find the Scala version on your cluster, refer to the “System environment” section of the Databricks Runtime release notes for your cluster. See Databricks Runtime release notes versions and compatibility.
A Scala build tool on your development machine, such as
sbt
.
Set up the client
After you meet the requirements for Databricks Connect, complete the following steps to set up the Databricks Connect client.
Step 1: Add a reference to the Databricks Connect client
In your Scala project’s build file such as
build.sbt
forsbt
,pom.xml
for Maven, orbuild.gradle
for Gradle, add the following reference to the Databricks Connect client:libraryDependencies += "com.databricks" % "databricks-connect" % "14.0.0"
<dependency> <groupId>com.databricks</groupId> <artifactId>databricks-connect</artifactId> <version>14.0.0</version> </dependency>
implementation 'com.databricks.databricks-connect:14.0.0'
Replace
14.0.0
with the version of the Databricks Connect library that matches the Databricks Runtime version on your cluster. You can find the Databricks Connect library version numbers in the Maven central repository.
Step 2: Configure connection properties
In this section, you configure properties to establish a connection between Databricks Connect and your remote Databricks cluster. These properties include settings to authenticate Databricks Connect with your cluster.
For Databricks Connect for Databricks Runtime 13.3 LTS and above, for Scala, Databricks Connect includes the Databricks SDK for Java. This SDK implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach makes setting up and automating authentication with Databricks more centralized and predictable. It enables you to configure Databricks authentication once and then use that configuration across multiple Databricks tools and SDKs without further authentication configuration changes.
Note
OAuth user-to-machine (U2M) authentication is supported on Databricks SDK for Java 0.18.0 and above. You might need to update your code project’s installed version of the Databricks SDK for Java to 0.18.0 or above to use OAuth U2M authentication. See Get started with the Databricks SDK for Java.
For OAuth U2M authentication, you must use the Databricks CLI to authenticate before you run your Scala code. See the Tutorial.
OAuth machine-to-machine (M2M) authentication is supported on Databricks SDK for Java 0.17.0 and above. You might need to update your code project’s installed version of the Databricks SDK for Java to 0.17.0 or above to use OAuth U2M authentication. See Get started with the Databricks SDK for Java.
Google Cloud credentials authentication and Google Cloud ID authentication are supported on Databricks SDK for Java 0.14.0 and above. You might need to update your code project’s installed version of the Databricks SDK for Java to 0.14.0 or above to use Google Cloud credentials authentication or ID authentication. See Get started with the Databricks SDK for Java.
Collect the following configuration properties.
The Databricks workspace instance name. This is the same as the Server Hostname value for your cluster; see Get connection details for a Databricks compute resource.
The ID of your cluster. You can obtain the cluster ID from the URL. See Cluster URL and ID.
Any other properties that are necessary for the supported Databricks authentication type. These properties are described throughout this section.
Configure the connection within your code. Databricks Connect searches for configuration properties in the following order until it finds them. Once it finds them, it stops searching through the remaining options. The details for each option appear after the following table:
Configuration properties option
Applies to
The
DatabricksSession
class’sremote()
method
Databricks personal access token authentication only
A Databricks configuration profile
All Databricks authentication types
The
SPARK_REMOTE
environment variable
Databricks personal access token authentication only
The
DATABRICKS_CONFIG_PROFILE
environment variable
All Databricks authentication types
An environment variable for each configuration property
All Databricks authentication types
A Databricks configuration profile named
DEFAULT
All Databricks authentication types
The
DatabricksSession
class’sremote()
methodFor this option, which applies to Databricks personal access token authentication only, specify the workspace instance name, the Databricks personal access token, and the ID of the cluster.
You can initialize the
DatabricksSession
class in several ways, as follows:Set the
host
,token
, andclusterId
fields inDatabricksSession.builder
.Use the Databricks SDK’s
Config
class.Specify a Databricks configuration profile along with the
clusterId
field.
Databricks does not recommend that you directly specify these connection properties in your code. Instead, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed
retrieve*
functions yourself to get the necessary properties from the user or from some other configuration store, such as Google Cloud Secret Manager.The code for each of these approaches is as follows:
// Set the host, token, and clusterId fields in DatabricksSession.builder. // If you have already set the DATABRICKS_CLUSTER_ID environment variable with the // cluster's ID, you do not also need to set the clusterId field here. import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder() .host(retrieveWorkspaceInstanceName()) .token(retrieveToken()) .clusterId(retrieveClusterId()) .getOrCreate() // Use the Databricks SDK's Config class. // If you have already set the DATABRICKS_CLUSTER_ID environment variable with the // cluster's ID, you do not also need to set the clusterId field here. import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setHost(retrieveWorkspaceInstanceName()) .setToken(retrieveToken()) val spark = DatabricksSession.builder() .sdkConfig(config) .clusterId(retrieveClusterId()) .getOrCreate() // Specify a Databricks configuration profile along with the clusterId field. // If you have already set the DATABRICKS_CLUSTER_ID environment variable with the // cluster's ID, you do not also need to set the clusterId field here. import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setProfile("<profile-name>") val spark = DatabricksSession.builder() .sdkConfig(config) .clusterId(retrieveClusterId()) .getOrCreate()
A Databricks configuration profile
For this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the supported Databricks authentication type that you want to use.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For OAuth machine-to-machine (M2M) authentication (where supported):
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication (where supported):
host
.For Google Cloud credentials authentication (where supported):
host
andgoogle_credentials
.For Google Cloud ID authentication (where supported):
host
andgoogle_service_acccount
.
Then set the name of this configuration profile through the
DatabricksConfig
class.You can specify
cluster_id
in a few ways, as follows:Include the
cluster_id
field in your configuration profile, and then just specify the configuration profile’s name.Specify the configuration profile name along with the
clusterId
field.
If you have already set the
DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specify thecluster_id
orclusterId
fields.The code for each of these approaches is as follows:
// Include the cluster_id field in your configuration profile, and then // just specify the configuration profile's name: import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setProfile("<profile-name>") val spark = DatabricksSession.builder() .sdkConfig(config) .getOrCreate() // Specify the configuration profile name along with the clusterId field. // In this example, retrieveClusterId() assumes some custom implementation that // you provide to get the cluster ID from the user or from some other // configuration store: import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setProfile("<profile-name>") val spark = DatabricksSession.builder() .sdkConfig(config) .clusterId(retrieveClusterId()) .getOrCreate()
The
SPARK_REMOTE
environment variableFor this option, which applies to Databricks personal access token authentication only, set the
SPARK_REMOTE
environment variable to the following string, replacing the placeholders with the appropriate values.sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
Then initialize the
DatabricksSession
class as follows:import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()
To set environment variables, see your operating system’s documentation.
The
DATABRICKS_CONFIG_PROFILE
environment variableFor this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the supported Databricks authentication type that you want to use.If you have already set the
DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specifycluster_id
.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For OAuth machine-to-machine (M2M) authentication (where supported):
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication (where supported):
host
.For Google Cloud credentials authentication (where supported):
host
andgoogle_credentials
.For Google Cloud ID authentication (where supported):
host
andgoogle_service_acccount
.
Set the
DATABRICKS_CONFIG_PROFILE
environment variable to the name of this configuration profile. Then initialize theDatabricksSession
class as follows:import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()
To set environment variables, see your operating system’s documentation.
An environment variable for each configuration property
For this option, set the
DATABRICKS_CLUSTER_ID
environment variable and any other environment variables that are necessary for the supported Databricks authentication type that you want to use.The required environment variables for each authentication type are as follows:
For Databricks personal access token authentication:
DATABRICKS_HOST
andDATABRICKS_TOKEN
.For OAuth machine-to-machine (M2M) authentication (where supported):
DATABRICKS_HOST
,DATABRICKS_CLIENT_ID
,DATABRICKS_CLIENT_SECRET
.For OAuth user-to-machine (U2M) authentication (where supported):
DATABRICKS_HOST
.For Google Cloud credentials authentication (where supported):
DATABRICKS_HOST
andGOOGLE_CREDENTIALS
.For Google Cloud ID authentication (where supported):
DATABRICKS_HOST
andGOOGLE_SERVICE_ACCOUNT
.
Then initialize the
DatabricksSession
class as follows:import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()
To set environment variables, see your operating system’s documentation.
A Databricks configuration profile named
DEFAULT
For this option, create or identify a Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the supported Databricks authentication type that you want to use.If you have already set the
DATABRICKS_CLUSTER_ID
environment variable with the cluster’s ID, you do not also need to specifycluster_id
.The required configuration profile fields for each authentication type are as follows:
For Databricks personal access token authentication:
host
andtoken
.For OAuth machine-to-machine (M2M) authentication (where supported):
host
,client_id
, andclient_secret
.For OAuth user-to-machine (U2M) authentication (where supported):
host
.For Google Cloud credentials authentication (where supported):
host
andgoogle_credentials
.For Google Cloud ID authentication (where supported):
host
andgoogle_service_acccount
.
Name this configuration profile
DEFAULT
.Then initialize the
DatabricksSession
class as follows:scala import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()