Introduction
Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes.
For a quick overview and benefits of Delta Lake, watch this YouTube video (3 minutes).
Specifically, Delta Lake offers:
ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
Scalable metadata handling: Leverages Spark distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.
For a general introduction and demonstration of Delta Lake, watch this YouTube video (51 minutes).
Delta Engine optimizations make Delta Lake operations highly performant, supporting a variety of workloads ranging from large-scale ETL processing to ad-hoc, interactive queries. For information on Delta Engine, see Optimizations.
Quickstart
The Delta Lake quickstart provides an overview of the basics of working with Delta Lake. The quickstart shows how to load data into a Delta table, modify the table, read the table, display table history, and optimize the table.
For Databricks notebooks that demonstrate these features, see Introductory notebooks.
To try out Delta Lake, see Sign up for Databricks.
Key tasks
The following list provides links to documentation for common Delta Lake tasks.
Create a Delta table: quick start, as part of batch data tasks
Load and write data into a Delta Lake table:
With COPY INTO
With Auto Loader
With the Create Table UI in Databricks SQL.
With streaming: quick start, as part of streaming
With third-party solutions: with partners, with third-party providers
Merge data updates and insertions (upserts): quick start, as part of table updates
Read data from a Delta table: quick start, as part of batch data tasks, as part of streaming
Optimize a Delta table: quick start, as part of bin packing, as part of Z-ordering, as part of file size tuning
Display the history of a Delta table: quick start, as part of data utilities
Clean up Delta table snapshots (vacuum): quick start, as part of data utilities
Work with Delta table columns:
Work with Delta table versions:
Query an earlier version of a Delta table (time travel): quick start, as part of batch data tasks
Work with Delta table metadata:
Learn about Delta Lake concurrency control (ACID transactions)
Resources
For answers to frequently asked questions, see Frequently asked questions (FAQ).
For reference information on Delta Lake SQL commands, see Delta Lake statements.
For further resources, including blog posts, talks, and examples, see Delta Lake resources.
For deep-dive training on Delta Lake, watch this YouTube video (2 hours, 42 minutes).