Advanced usage of Databricks Connect for Python
Note
This article covers Databricks Connect for Databricks Runtime 14.0 and above.
This article describes topics that go beyond the basic setup of Databricks Connect.
Configure the Spark Connect connection string
In addition to connecting to your cluster using the options outlined in Configure a connection to a cluster, a more advanced option is connecting using the Spark Connect connection string. You can pass the string in the remote
function or set the SPARK_REMOTE
environment variable.
Note
You can only use a Databricks personal access token authentication to connect using the Spark Connect connection string.
To set the connection string using the remote
function:
# Set the Spark Connect connection string in DatabricksSession.builder.remote.
from databricks.connect import DatabricksSession
workspace_instance_name = retrieve_workspace_instance_name()
token = retrieve_token()
cluster_id = retrieve_cluster_id()
spark = DatabricksSession.builder.remote(
f"sc://{workspace_instance_name}:443/;token={token};x-databricks-cluster-id={cluster_id}"
).getOrCreate()
Alternatively, set the SPARK_REMOTE
environment variable:
sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
Then initialize the DatabricksSession
class as follows:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
Pyspark shell
Databricks Connect for Python ships with a pyspark
binary which is a PySpark REPL (a Spark shell) configured to use Databricks Connect.
When started with no additional parameters, the shell picks up default credentials from the environment (for example., the DATABRICKS_
environment variables or the DEFAULT
configuration profile) to connect to the Databricks cluster. For information about configuring a connection, see Compute configuration for Databricks Connect.
To start the Spark shell and to connect it to your running cluster, run one of the following commands from your activated Python virtual environment:
pyspark
The Spark shell appears, for example:
Python 3.10 ... [Clang ...] on darwin Type "help", "copyright", "credits" or "license" for more information. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 13.x.dev0 /_/ Using Python version 3.10 ... Client connected to the Spark Connect server at sc://...:.../;token=...;x-databricks-cluster-id=... SparkSession available as 'spark'. >>>
Once the shell starts up, the
spark
object is available to run Apache Spark commands on the Databricks cluster. Run a simple PySpark command, such asspark.range(1,10).show()
. If there are no errors, you have successfully connected.Refer to Interactive Analysis with the Spark Shell for information about how to use the Spark shell with Python to run commands on your compute.
Use the built-in
spark
variable to represent theSparkSession
on your running cluster, for example:>>> df = spark.read.table("samples.nyctaxi.trips") >>> df.show(5) +--------------------+---------------------+-------------+-----------+----------+-----------+ |tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|pickup_zip|dropoff_zip| +--------------------+---------------------+-------------+-----------+----------+-----------+ | 2016-02-14 16:52:13| 2016-02-14 17:16:04| 4.94| 19.0| 10282| 10171| | 2016-02-04 18:44:19| 2016-02-04 18:46:00| 0.28| 3.5| 10110| 10110| | 2016-02-17 17:13:57| 2016-02-17 17:17:55| 0.7| 5.0| 10103| 10023| | 2016-02-18 10:36:07| 2016-02-18 10:41:45| 0.8| 6.0| 10022| 10017| | 2016-02-22 14:14:41| 2016-02-22 14:31:52| 4.51| 17.0| 10110| 10282| +--------------------+---------------------+-------------+-----------+----------+-----------+ only showing top 5 rows
All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.
To stop the Spark shell, press
Ctrl + d
orCtrl + z
, or run the commandquit()
orexit()
.
Additional HTTP headers
Databricks Connect communicates with the Databricks Clusters via gRPC over HTTP/2.
Some advanced users may choose to install a proxy service between the client and the Databricks cluster, to have better control over the requests coming from their clients.
The proxies, in some cases, may require custom headers in the HTTP requests.
The headers()
method can be used to add custom headers to their HTTP requests.
spark = DatabricksSession.builder.header('x-custom-header', 'value').getOrCreate()
Certificates
If your cluster relies on a custom SSL/TLS certificate to resolve a Databricks workspace fully qualified domain name (FQDN), you must set the environment variable GRPC_DEFAULT_SSL_ROOTS_FILE_PATH
on your local development machine. This environment variable must be set to the full path to the installed certificate on the cluster.
For example, you set this environment variable in Python code as follows:
import os
os.environ["GRPC_DEFAULT_SSL_ROOTS_FILE_PATH"] = "/etc/ssl/certs/ca-bundle.crt"
For other ways to set environment variables, see your operating system’s documentation.
Logging and debug logs
Databricks Connect for Python produces logs using standard Python logging.
Logs are emitted to the standard error stream (stderr) and by default they are only logs at WARN level and higher are emitted.
Setting an environment variable SPARK_CONNECT_LOG_LEVEL=debug
will modify this default and print all log messages at the DEBUG
level and higher.