What is Databricks Connect?

Note

This article covers Databricks Connect for Databricks Runtime 13.0 and above.

For information about the legacy version of Databricks Connect, see Databricks Connect for Databricks Runtime 12.2 LTS and below.

Overview

Databricks Connect allows you to connect popular IDEs such as Visual Studio Code, PyCharm, RStudio Desktop, IntelliJ IDEA, notebook servers, and other custom applications to Databricks clusters. This article explains how Databricks Connect works.

Databricks Connect is a client library for the Databricks Runtime. It allows you to write code using Spark APIs and run them remotely on a Databricks cluster instead of in the local Spark session.

For example, when you run the DataFrame command spark.read.format(...).load(...).groupBy(...).agg(...).show() using Databricks Connect, the logical representation of the command is sent to the Spark server running in Databricks for execution on the remote cluster.

With Databricks Connect, you can:

  • Run large-scale Spark code from any Python, R, or Scala application. Anywhere you can import pyspark for Python, library(sparklyr) for R, or import org.apache.spark for Scala, you can now run Spark code directly from your application, without needing to install any IDE plugins or use Spark submission scripts.

    Note

    Databricks Connect for Databricks Runtime 13.0 and above support running Python applications. R and Scala are supported only in Databricks Connect for Databricks Runtime 13.3 LTS and above.

  • Step through and debug code in your IDE even when working with a remote cluster.

  • Iterate quickly when developing libraries. You do not need to restart the cluster after changing Python or Scala library dependencies in Databricks Connect, because each client session is isolated from each other in the cluster.

  • Shut down idle clusters without losing work. Because the client application is decoupled from the cluster, it is unaffected by cluster restarts or upgrades, which would normally cause you to lose all the variables, RDDs, and DataFrame objects defined in a notebook.

For Databricks Runtime 13.0 and above, Databricks Connect is now built on open-source Spark Connect. Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. With this “V2” architecture based on Spark Connect, Databricks Connect becomes a thin client that is simple and easy to use. Spark Connect can be embedded everywhere to connect to Databricks: in IDEs, notebooks, and applications, allowing individual users and partners alike to build new (interactive) user experiences based on the Databricks platform. For more information about Spark Connect, see Introducing Spark Connect.

Databricks Connect determines where your code runs and debugs, as shown in the following figure.

Figure showing were Databricks Connect code runs and debugs

For running code: All code runs locally, while all code involving DataFrame operations runs on the cluster in the remote Databricks workspace and run responses are sent back to the local caller.

For debugging code: All code is debugged locally, while all Spark code continues to run on the cluster in the remote Databricks workspace. The core Spark engine code cannot be debugged directly from the client.

Next steps