Use the Spark shell with Databricks Connect for Python
Note
This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.
This article covers how to use Databricks Connect for Python and the Spark shell. Databricks Connect enables you to connect popular applications to Databricks compute. See What is Databricks Connect?.
Note
Before you begin to use Databricks Connect, you must set up the Databricks Connect client.
The Spark shell works with Databricks personal access token authentication authentication only.
To use Databricks Connect with the Spark shell and Python, follow these instructions.
To start the Spark shell and to connect it to your running cluster, run one of the following commands from your activated Python virtual environment:
If you have set the
SPARK_REMOTE
environment variable, run the following command:pyspark
If you have not set the
SPARK_REMOTE
environment variable, run the following command instead:pyspark --remote "sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>"
The Spark shell appears, for example:
Python 3.10 ... [Clang ...] on darwin Type "help", "copyright", "credits" or "license" for more information. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 13.x.dev0 /_/ Using Python version 3.10 ... Client connected to the Spark Connect server at sc://...:.../;token=...;x-databricks-cluster-id=... SparkSession available as 'spark'. >>>
Now run a simple PySpark command, such as spark.range(1,10).show()
. If there are no errors, you have successfully connected.
Refer to Interactive Analysis with the Spark Shell for information about how to use the Spark shell with Python to run commands on your compute.
Use the built-in
spark
variable to represent theSparkSession
on your running cluster, for example:>>> df = spark.read.table("samples.nyctaxi.trips") >>> df.show(5) +--------------------+---------------------+-------------+-----------+----------+-----------+ |tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|pickup_zip|dropoff_zip| +--------------------+---------------------+-------------+-----------+----------+-----------+ | 2016-02-14 16:52:13| 2016-02-14 17:16:04| 4.94| 19.0| 10282| 10171| | 2016-02-04 18:44:19| 2016-02-04 18:46:00| 0.28| 3.5| 10110| 10110| | 2016-02-17 17:13:57| 2016-02-17 17:17:55| 0.7| 5.0| 10103| 10023| | 2016-02-18 10:36:07| 2016-02-18 10:41:45| 0.8| 6.0| 10022| 10017| | 2016-02-22 14:14:41| 2016-02-22 14:31:52| 4.51| 17.0| 10110| 10282| +--------------------+---------------------+-------------+-----------+----------+-----------+ only showing top 5 rows
All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.
To stop the Spark shell, press
Ctrl + d
orCtrl + z
, or run the commandquit()
orexit()
.