Use the Spark shell with Databricks Connect for Python

Note

This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.

This article covers how to use Databricks Connect for Python and the Spark shell. Databricks Connect enables you to connect popular applications to Databricks compute. See What is Databricks Connect?.

Note

Before you begin to use Databricks Connect, you must set up the Databricks Connect client.

The Spark shell works with Databricks personal access token authentication authentication only.

To use Databricks Connect with the Spark shell and Python, follow these instructions.

To start the Spark shell and to connect it to your running cluster, run one of the following commands from your activated Python virtual environment:

If you have set the SPARK_REMOTE environment variable, run the following command:

pyspark

If you have not set the SPARK_REMOTE environment variable, run the following command instead:

pyspark --remote "sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>"

The Spark shell appears, for example:

Python 3.10 ...
[Clang ...] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 13.x.dev0
     /_/

Using Python version 3.10 ...
Client connected to the Spark Connect server at sc://...:.../;token=...;x-databricks-cluster-id=...
SparkSession available as 'spark'.
>>>

Now run a simple PySpark command, such as spark.range(1,10).show(). If there are no errors, you have successfully connected.

Refer to Interactive Analysis with the Spark Shell for information about how to use the Spark shell with Python to run commands on your compute.

Use the built-in spark variable to represent the SparkSession on your running cluster, for example:

>>> df = spark.read.table("samples.nyctaxi.trips")
>>> df.show(5)
+--------------------+---------------------+-------------+-----------+----------+-----------+
|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|pickup_zip|dropoff_zip|
+--------------------+---------------------+-------------+-----------+----------+-----------+
| 2016-02-14 16:52:13|  2016-02-14 17:16:04|         4.94|       19.0|     10282|      10171|
| 2016-02-04 18:44:19|  2016-02-04 18:46:00|         0.28|        3.5|     10110|      10110|
| 2016-02-17 17:13:57|  2016-02-17 17:17:55|          0.7|        5.0|     10103|      10023|
| 2016-02-18 10:36:07|  2016-02-18 10:41:45|          0.8|        6.0|     10022|      10017|
| 2016-02-22 14:14:41|  2016-02-22 14:31:52|         4.51|       17.0|     10110|      10282|
+--------------------+---------------------+-------------+-----------+----------+-----------+
only showing top 5 rows

All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.

To stop the Spark shell, press Ctrl + d or Ctrl + z, or run the command quit() or exit().