Use the Spark shell with Databricks Connect for Python

Note

This article covers Databricks Connect for Databricks Runtime 13.0 and above.

This article covers how to use Databricks Connect for Python and the Spark shell. Databricks Connect enables you to connect popular applications to Databricks clusters. See What is Databricks Connect?.

Note

Before you begin to use Databricks Connect, you must set up the Databricks Connect client.

The Spark shell works with Databricks personal access token authentication authentication only.

To use Databricks Connect with the Spark shell and Python, follow these instructions.

  1. To start the Spark shell and to connect it to your running cluster, run one of the following commands from your activated Python virtual environment:

    If you set the SPARK_REMOTE environment variable earlier, run the following command:

    pyspark
    

    If you did not set the SPARK_REMOTE environment variable earlier, run the following command instead:

    pyspark --remote "sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>"
    

    The Spark shell appears, for example:

    Python 3.10 ...
    [Clang ...] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 13.x.dev0
         /_/
    
    Using Python version 3.10 ...
    Client connected to the Spark Connect server at sc://...:.../;token=...;x-databricks-cluster-id=...
    SparkSession available as 'spark'.
    >>>
    
  2. Refer to Interactive Analysis with the Spark Shell for information about how to use the Spark shell with Python to run commands on your cluster.

    Use the built-in spark variable to represent the SparkSession on your running cluster, for example:

    >>> df = spark.read.table("samples.nyctaxi.trips")
    >>> df.show(5)
    +--------------------+---------------------+-------------+-----------+----------+-----------+
    |tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|pickup_zip|dropoff_zip|
    +--------------------+---------------------+-------------+-----------+----------+-----------+
    | 2016-02-14 16:52:13|  2016-02-14 17:16:04|         4.94|       19.0|     10282|      10171|
    | 2016-02-04 18:44:19|  2016-02-04 18:46:00|         0.28|        3.5|     10110|      10110|
    | 2016-02-17 17:13:57|  2016-02-17 17:17:55|          0.7|        5.0|     10103|      10023|
    | 2016-02-18 10:36:07|  2016-02-18 10:41:45|          0.8|        6.0|     10022|      10017|
    | 2016-02-22 14:14:41|  2016-02-22 14:31:52|         4.51|       17.0|     10110|      10282|
    +--------------------+---------------------+-------------+-----------+----------+-----------+
    only showing top 5 rows
    

    All Python code runs locally, while all PySpark code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.

  3. To stop the Spark shell, press Ctrl + d or Ctrl + z, or run the command quit() or exit().