VSCode extension for Databricks tutorial: Run Python on a cluster and as a job

This tutorial demonstrates how to get started with the Databricks extension for Visual Studio Code by running a basic Python code file on a Databricks cluster and as a Databricks job run in your remote workspace. See What is the Databricks extension for Visual Studio Code?.

What will you do in this tutorial?

In this hands-on tutorial, you do the following:

  • Create a Databricks cluster to run your local Python code on.

  • Install Visual Studio Code and the Databricks extension for Visual Studio Code.

  • Set up Databricks authentication and configure the Databricks extension for Visual Studio Code with this information.

  • Configure the Databricks extension for Visual Studio Code with information about your remote cluster, and have the extension to start the cluster.

  • Configure the Databricks extension for Visual Studio Code with the location in your remote Databricks workspace to upload your local Python code to, and have the extension start listening for code upload events.

  • Write and save some Python code, which triggers a code upload event.

  • Use the Databricks extension for Visual Studio Code to run the uploaded code on your remote cluster and then to run it with your cluster as a remote job run.

This tutorial demonstrates only how to run a Python code file, and this tutorial demonstrates only how to set up OAuth user-to-machine (U2M) authentication. To learn how to debug Python code files, run and debug notebooks, and set up other authentication types, see Next steps.

Note

The following tutorial uses the Databricks extension for Visual Studio Code, version 1. To complete this tutorial for the Databricks extension for Visual Studio Code, version 2, currently in Private Preview, skip ahead to VSCode extension for Databricks, version 2 tutorial: Run Python on a cluster and as a job.

Step 1: Create a cluster

If you already have a remote Databricks cluster that you want to use, make a note of the cluster’s name, and skip ahead to Step 2 to install Visual Studio Code. To view your available clusters, in your workspace’s sidebar, click Compute.

Databricks recommends that you create a Personal Compute cluster to get started quickly. To create this cluster, do the following:

  1. In your Databricks workspace, on the sidebar, click Compute.

  2. Click Create with Personal Compute.

  3. Click Create compute.

  4. Make a note of your cluster’s name, as you will need it later in Step 5 when you add cluster information to the extension.

Step 2: Install Visual Studio Code

To install Visual Studio Code, follow the instructions for macOS, Linux, or Windows.

If you already have Visual Studio Code installed, check whether it is version 1.69.1 or above. To do this, in Visual Studio Code, on the main menu, click Code > About Visual Studio Code for macOS or Help > About for Linux or Windows.

To update Visual Studio Code, on the main menu, click Code > Check for Updates for macOS or Help > Check for Updates for Linux or Windows.

Step 3: Install the Databricks extension

Install the Visual Studio Code extension
  1. In the Visual Studio Code sidebar, click the Extensions icon.

  2. In Search Extensions in Marketplace, enter Databricks.

  3. In the entry labelled Databricks with the subtitle IDE support for Databricks by Databricks, click Install.

Step 4: Set up Databricks authentication

In this step, you enable authentication between the Databricks extension for Visual Studio Code and your remote Databricks workspace, as follows:

  1. From Visual Studio Code, open an empty folder on your local development machine that you will use to contain the Python code that you will create and run later in Step 7. To do this, on the main menu, click File > Open Folder and follow the on-screen directions.

  2. On the Visual Studio Code sidebar, click the Databricks logo icon.

  3. In the Configuration pane, click Configure Databricks.

  4. In the Command Palette, for Databricks Host, enter your workspace instance URL, for example https://1234567890123456.7.gcp.databricks.com. Then press Enter.

  5. Select OAuth (user to machine).

  6. Complete the on-screen instructions in your web browser to finish authenticating with Databricks. If prompted, allow all-apis access.

Step 5: Add cluster information to the Databricks extension and start the cluster

  1. With the Configuration pane already open from the previous Step where you set up authentication, next to Cluster, click the gear (Configure cluster) icon.

  2. In the Command Palette, select the name of the cluster that you created in Step 1.

  3. Start the cluster, if it is not already started: next to Cluster, if the play (Start Cluster) icon is visible, click it.

Start the cluster

Step 6: Add the code upload location to the Databricks extension and start the upload listener

  1. With the Configuration pane already open from the previous Step where you added cluster information, next to Sync Destination, click the gear (Configure sync destination) icon.

  2. In the Command Palette, select Create New Sync Destination.

  3. Press Enter to confirm the generated remote upload directory name.

  4. Start the upload listener, if it is not already started: next to Sync Destination, if the arrowed circle (Start synchronization) icon is visible, click it.

Start the upload listener

Step 7: Create and run Python code

  1. Create a local Python code file: on the sidebar, click the folder (Explorer) icon.

  2. On the main menu, click File > New File. Name the file demo.py and save it to the project’s root.

  3. Add the following code to the file and then save it. This code creates and displays the contents of a basic PySpark DataFrame:

    from pyspark.sql import SparkSession
    from pyspark.sql.types import *
    
    spark = SparkSession.builder.getOrCreate()
    
    schema = StructType([
       StructField('CustomerID', IntegerType(), False),
       StructField('FirstName',  StringType(),  False),
       StructField('LastName',   StringType(),  False)
    ])
    
    data = [
       [ 1000, 'Mathijs', 'Oosterhout-Rijntjes' ],
       [ 1001, 'Joost',   'van Brunswijk' ],
       [ 1002, 'Stan',    'Bokenkamp' ]
    ]
    
    customers = spark.createDataFrame(data, schema)
    customers.show()
    
    # Output:
    #
    # +----------+---------+-------------------+
    # |CustomerID|FirstName|           LastName|
    # +----------+---------+-------------------+
    # |      1000|  Mathijs|Oosterhout-Rijntjes|
    # |      1001|    Joost|      van Brunswijk|
    # |      1002|     Stan|          Bokenkamp|
    # +----------+---------+-------------------+
    
  4. In the Explorer view, right-click the demo.py file, and then click Upload and Run File on Databricks. The output appears in the Debug Console pane.

Upload and Run File on Databricks

Step 8: Run the code as a job

In the previous Step, you ran your Python code directly on the remote cluster. In this step, you initiate a workflow that uses the cluster to run the code as a Databricks job instead. See What is Databricks Jobs?.

To run this code as a job, in the Explorer view, right-click the demo.py file, and then click Run File as Workflow on Databricks. The output appears in a separate editor tab next to the demo.py file editor.

Run File as Workflow on Databricks

You have reached the end of this tutorial.

Next steps

Now that you have successfully used the Databricks extension for Visual Studio Code to upload a local Python file and run it remotely, learn more about how to use the extension:

VSCode extension for Databricks, version 2 tutorial: Run Python on a cluster and as a job

Note

This tutorial uses the Databricks extension for Visual Studio Code, version 2, which is in Private Preview.

Step 1: Create a cluster

If you already have a remote Databricks cluster that you want to use, make a note of the cluster’s name, and skip ahead to Step 2 where you will install Visual Studio Code. To view your available clusters, in your workspace’s sidebar, click Compute.

Databricks recommends that you create a Personal Compute cluster to get started quickly. To create this cluster, do the following:

  1. In your Databricks workspace, on the sidebar, click Compute.

  2. Click Create with Personal Compute.

  3. Click Create compute.

  4. Make a note of your cluster’s name, as you will need it later in Step 5 where you add cluster information to the extension.

Step 2: Install Visual Studio Code

To install Visual Studio Code, follow the instructions for macOS, Linux, or Windows.

If you already have Visual Studio Code installed, check whether it is version 1.69.1 or above. To do this, in Visual Studio Code, on the main menu, click Code > About Visual Studio Code for macOS or Help > About for Linux or Windows.

To update Visual Studio Code, on the main menu, click Code > Check for Updates for macOS or Help > Check for Updates for Linux or Windows.

Step 3: Install the Databricks extension

Install the Visual Studio Code extension
  1. In the Visual Studio Code sidebar, click the Extensions icon.

  2. In Search Extensions in Marketplace, enter Databricks.

  3. In the entry labelled Databricks with the subtitle IDE support for Databricks by Databricks, click the down arrow next to Install, and then click Install Pre-Release Version.

  4. Do one of the following:

    • If you have not yet accepted the terms and conditions regarding this preview, click Contact us and follow the on-screen instructions to send a request to Databricks. You may not use this preview until you have accepted the Databricks terms and conditions.

    • If you have already accepted the terms and conditions regarding this preview, click Continue if you are already enrolled, and then click Reload Required or restart Visual Studio Code.

Step 4: Set up Databricks authentication

In this step, you enable authentication between the Databricks extension for Visual Studio Code and your remote Databricks workspace, as follows:

  1. From Visual Studio Code, click File > Open Folder and open some empty folder on your local development machine.

  2. On the sidebar, click the Databricks logo icon.

  3. In the Configuration pane, click Initialize Project.

    Initialize New Project
  4. In the Command Palette, for Databricks Host, enter your workspace instance URL, for example https://1234567890123456.7.gcp.databricks.com. Then press Enter.

  5. If you already have an authentication configuration profile in this list that has the label Authenticate using OAuth (User to Machine) label and that you know corresponds to the target Databricks host, select it from the list, and then do the following:

    1. If prompted, complete any on-screen instructions in your web browser to finish authenticating with Databricks.

    2. If also prompted, allow all-apis access.

    3. Skip ahead to Step 5 where you will add cluster information to the extension.

  6. For Select authentication method, select OAuth (user to machine).

    Note

    Databricks recommends that you select OAuth (user to machine) to get started quickly. To use other authentication types, see Authentication setup for the Databricks extension for Visual Studio Code.

  7. Enter some name for the associated Databricks authentication profile.

  8. In the Configuration pane, click Login to Databricks.

    Login to Databricks
  9. In the Command Palette, for Select authentication method, select the name of the authentication configuration profile that you just created.

  10. If prompted, complete any on-screen instructions in your web browser to finish authenticating with Databricks. If also prompted, allow all-apis access.

  11. After you have successfully logged in, return to Visual Studio Code.

Step 5: Add cluster information to the Databricks extension and start the cluster

  1. With the Configuration pane already open from Step 4 where you set up authentication, click Select a cluster and then click the gear (Configure cluster) icon.

    Configure cluster
  2. In the Command Palette, select the name of the cluster that you created previously in Step 1.

  3. Start the cluster, if it is not already started: click Cluster and, if the play (Start Cluster) icon is visible, click it.

Step 6: Create and run Python code

  1. Create a local Python code file: on the sidebar, click the folder (Explorer) icon.

  2. On the main menu, click File > New File. Name the file demo.py and save it to the project’s root.

  3. Add the following code to the file and then save it. This code creates and displays the contents of a basic PySpark DataFrame:

    from pyspark.sql import SparkSession
    from pyspark.sql.types import *
    
    spark = SparkSession.builder.getOrCreate()
    
    schema = StructType([
       StructField('CustomerID', IntegerType(), False),
       StructField('FirstName',  StringType(),  False),
       StructField('LastName',   StringType(),  False)
    ])
    
    data = [
       [ 1000, 'Mathijs', 'Oosterhout-Rijntjes' ],
       [ 1001, 'Joost',   'van Brunswijk' ],
       [ 1002, 'Stan',    'Bokenkamp' ]
    ]
    
    customers = spark.createDataFrame(data, schema)
    customers.show()
    
    # Output:
    #
    # +----------+---------+-------------------+
    # |CustomerID|FirstName|           LastName|
    # +----------+---------+-------------------+
    # |      1000|  Mathijs|Oosterhout-Rijntjes|
    # |      1001|    Joost|      van Brunswijk|
    # |      1002|     Stan|          Bokenkamp|
    # +----------+---------+-------------------+
    
  4. In the Explorer view, right-click the demo.py file, and then click Run on Databricks > Upload and Run File. The output appears in the Debug Console pane.

    Upload and run file from context menu

    Tip

    Another way to do this is to click the Run on Databricks icon next to the list of editor tabs, and then click Upload and Run File.

    Upload and run file from icon

Step 7: Run the code as a job

In the previous Step, you ran your Python code directly on the remote cluster. In this step, you initiate a workflow that uses the cluster to run the code as a Databricks job instead. See What is Databricks Jobs?.

To run this code as a job, in the Explorer view, right-click the demo.py file, and then click Run on Databricks > Run File as Workflow. The output appears in a separate editor tab next to the demo.py file editor.

Run file as workflow from context menu

Tip

Another way to do this is to click the Run on Databricks icon next to the list of editor tabs, and then click Run File as Workflow.

Run file as workflow from icon

You have reached the end of this tutorial.

Next steps

Now that you have successfully used the Databricks extension for Visual Studio Code to upload a local Python file and run it remotely, learn how to enable PySpark and Databricks Utilities code completion, run or debug Python code with Databricks Connect, use Databricks Asset Bundles, run a file or a notebook as a Databricks job, run tests with pytest, use environment variable definitions files, create custom run configurations, and more. See Development tasks for the Databricks extension for Visual Studio Code.