Databricks architecture overview

The Databricks Unified Data Analytics Platform, from the original creators of Apache Spark, enables data teams to collaborate in order to solve some of the world’s toughest problems.

High-level architecture

Databricks is structured to enable secure cross-functional team collaboration while keeping a significant amount of backend services managed by Databricks so you can stay focused on your data science, data analytics, and data engineering tasks.

Databricks operates out of a control plane and a data plane.

  • The control plane includes the backend services that Databricks manages in its own Google Cloud account. Notebook commands and many other workspace configurations are stored in the control plane and encrypted at rest.
  • The data plane is managed by your Google Cloud account and is where your data resides. This is also where data is processed. You can use Databricks connectors so that your clusters can connect to external data sources outside of your Google Cloud account to ingest data or for storage. You can also ingest data from external streaming data sources, such as events data, streaming data, IoT data, and more.

The following diagram represents the flow of data for Databricks on Google Cloud:

Databricks architecture

Your data always resides in your Google Cloud account in the data plane and in your own data sources, not the control plane, so you maintain control and ownership of your data.

Job results reside in storage in your account. Interactive notebook results are stored in a combination of the control plane (partial results for presentation in the UI) and your Google Cloud storage.

Note

If you want interactive notebook results stored only in your cloud account storage, you can ask your Databricks representative to enable interactive notebook results in the customer account for your workspace. Note that some metadata about results, such as chart column names, continues to be stored in the control plane. This feature is in Public Preview.