Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns.
Databricks adds optimized layouts and indexes to Delta Lake for fast interactive queries.
This guide provides an introductory overview, quickstarts, and guidance for using Delta Lake on Databricks.
Specifically, Delta Lake offers:
ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
Scalable metadata handling: Leverages Spark distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
Many Databricks Runtime optimizations require Delta Lake. Delta Lake operations are highly performant, supporting a variety of workloads ranging from large-scale ETL processing to ad-hoc, interactive queries. For information on optimizations on Databricks, see Optimizations and performance recommendations on Databricks.
The following list provides links to documentation for common Delta Lake tasks.
Load and write data into a Delta Lake table:
Merge data updates and insertions (upserts): as part of table updates
Work with Delta table columns:
Work with Delta table versions:
Work with Delta table metadata:
For Delta Lake-spefic SQL statements, see Delta Lake statements.
Databricks ensures binary compatibility with Delta Lake APIs in Databricks Runtime. To view the Delta Lake API version packaged in each Databricks Runtime version, see the Delta Lake API compatibility matrix. Delta Lake APIs exist for Python, Scala, and Java:
For answers to frequently asked questions, see Frequently asked questions (FAQ).
For reference information on Delta Lake SQL commands, see Delta Lake statements.
The Delta Lake transaction log has a well-defined open protocol that can be used by any system to read the log. See Delta Transaction Log Protocol.