Databricks architecture overview

This article provides a high-level overview of Databricks architecture, including its enterprise architecture, in combination with Google Cloud.

Control plane and compute plane

Databricks is structured to enable secure cross-functional team collaboration while keeping a significant amount of backend services managed by Databricks so you can stay focused on your data science, data analytics, and data engineering tasks.

Databricks operates out of a control plane and a compute plane.

  • The control plane includes the backend services that Databricks manages in your Databricks account. Notebook commands and many other workspace configurations are stored in the control plane and encrypted at rest.

  • The compute plane is where your data is processed by compute resources such as clusters. The compute plane’s network (VPC) and its compute resources are part of your organization’s Google Cloud resources. Use Databricks connectors to connect clusters to external data sources outside of your Google Cloud resources to ingest data, or for storage. You can also ingest data from external streaming data sources, such as events data, streaming data, IoT data, and more. See Connect to data sources.

    Note

    Previously, Databricks referred to the compute plane as the data plane.

To configure the networks for your compute plane, see Compute plane networking.

Your data lake is stored at rest in your Google Cloud resources and in your own data sources so you maintain control and ownership of your data.

Job results reside in storage in your Google Cloud resources. For interactive notebook results, storage is in a combination of the control plane (partial results for presentation in the UI) and your Google Cloud storage. If you want interactive notebook results stored only in your Google Cloud resources, you can configure the storage location for interactive notebook results. See Configure the storage location for interactive notebook results. Note that some metadata about results, such as chart column names, continues to be stored in the control plane.

High-level architecture

The following diagram represents the flow of data for Databricks on Google Cloud:

Diagram: Databricks on GCP architecture