Data modeling

This article introduces considerations, caveats, and recommendations for data modeling on Databricks. It is targeted toward users who are setting up new tables or authoring ETL workloads, with an emphasis on understanding Databricks behaviors that influence transforming raw data into a new data model. Data modeling decisions depend on how your organization and workloads use tables. The data model you choose impacts query performance, compute costs, and storage costs. This includes an introduction to the foundational concepts in database design with Databricks.

Important

This article exclusively applies to tables backed by Delta Lake, which includes all Unity Catalog managed tables.

You can use Databricks to query other external data sources, including tables registered with Lakehouse Federation. Each external data source has different limitations, semantics, and transactional guarantees. See Query data.

Database management concepts

A lakehouse built with Databricks shares many components and concepts with other enterprise data warehousing systems. Consider the following concepts and features while designing your data model.

Transactions on Databricks

Databricks scopes transactions to individual tables. This means that Databricks does not support multi-table statements (also called multi-statement transactions).

For data modeling workloads, this translates to having to perform multiple independent transactions when ingesting a source record requires inserting or updating rows into two or more tables. Each of these transactions can succeed or fail independent of other transactions, and downstream queries need to be tolerant of state mismatch due to failed or delayed transactions.

Primary and foreign keys on Databricks

Primary and foreign keys are informational and not enforced. This model is common in many enterprise cloud-based database systems, but differs from many traditional relational database systems. See Constraints on Databricks.

Joins on Databricks

Joins can introduce processing bottlenecks in any database design. When processing data on Databricks, the query optimizer seeks to optimize the plan for joins, but can struggle when an individual query must join results from many tables. The optimizer can also fail to skip records in a table when filter parameters are on a field in another table, which can result in a full table scan.

See Work with joins on Databricks.

Working with nested and complex data types

Databricks supports working with semi-structured data sources including JSON, Avro, and ProtoBuff, and storing complex data as structs, JSON strings, and maps and arrays. See Model semi-structured data.

Normalized data models

Databricks can work well with any data model. If you have an existing data model that you need to query from or migrate to Databricks, you should evaluate performance before rearchitecting your data.

If you are architecting a new lakehouse or adding datasets to an existing environment, Databricks recommends against using a heavily normalized model such as third normal form (3NF).

Models like the star schema or snowflake schema perform well on Databricks, as there are fewer joins present in standard queries and fewer keys to keep in sync. In addition, having more data fields in a single table allows the query optimizer to skip large amounts of data using file-level statistics. For more on data skipping, see Data skipping for Delta Lake.