Recommended workloads for Delta Live Tables
This article describes the recommended workloads for Delta Live Tables on Databricks.
Data ingestion
Delta Live Tables can ingest data from append-only sources and sources that contain changes, such as change data capture (CDC) feeds. Streaming tables support functionality for both types of source data.
Ingest data from append-only data sources
Streaming tables are recommended for ingesting append-only data. Append-only means that only new data is added to the source data, and existing data is never updated or deleted. Examples of append-only data include:
A Delta table with the table property
delta.appendOnly = true
.A cloud storage location that receives new files periodically.
A Kafka topic with events.
To learn more about ingesting data with Delta Live Tables and streaming tables, including examples, see Load data with Delta Live Tables.
Ingest data from a single append-only source
A streaming table can ingest data from any append-only data source.
Ingest data from multiple append-only sources
You can also ingest data from multiple append-only data sources into a streaming table. For example, you can write events from multiple Kafka topics into a single streaming table. To do this, define the query for the streaming table to read from one source, and for the other sources, use append flows.
Ingest historical data from an append-only source
You can use a backfill when you have an existing dataset that ingests data from an append-only source and want to append historical data to the dataset exactly once. To do this, load the historical data using an append flow query. See backfill.
Process change data feeds and database snapshots
Databricks recommends using Delta Live Tables to process change data feeds (CDF) that contain sequences of potentially out-of-order changes from one or more tables. Change data feeds are produced by Delta tables, in addition to systems such as Debezium, Qlik, and Amazon DMS. You can use either Python or SQL to process a change data feed with Delta Live Tables.
Databricks also recommends using Delta Live Tables when, instead of a change data feed, you need to process database snapshots, such as snapshots generated from an Oracle database, a MySQL database, or a data warehouse. Processing database snapshots is supported by the Delta Live Tables Python interface.
To process a CDF, use the APPLY CHANGES
API. See How is CDC implemented with the APPLY CHANGES API?.
To process database snapshots, use the APPLY CHANGES FROM SNAPSHOT
API. See How is CDC implemented with the APPLY CHANGES FROM SNAPSHOT API?.
Transform data
Delta Live Tables offers two solutions for transforming data. Materialized views are a good default choice as they always provide the correct result and automatically reprocess source data if needed. Streaming tables are recommended for low-complexity transformations over very large streams and are recommended for advanced use cases.
Transform data with materialized views
Materialized views are the recommended default for transformations in Delta Live Tables. They are simple and accurate. However, their downside is higher latency because materialized views might process all input data to ensure queries against the materialized view return the correct result.
Transform a single table with a materialized view
A materialized view can read from a Delta table or streaming table and perform arbitrary transformations on the input data. Materialized views can read all Delta tables, including those produced by systems other than Databricks, making them useful for migrations and hybrid pipelines.
Join a fact table with a dimension table (stream-snapshot join) with a materialized view
Materialized views can perform efficient, incremental joins between a base Delta table or streaming table and a “lookup” Delta table. These joins will be processed incrementally whenever possible. You do not need to use watermarks with materialized views and stream-snapshot joins.
Join two fact tables (stream-stream join)
Materialized views can perform efficient, incremental joins between two streaming tables or Delta tables. This is known as a stream-stream join, and materialized views will perform it incrementally whenever possible. You do not need to use watermarks with materialized views and stream-stream joins.
Transform data with streaming tables
Streaming tables are recommended when you need to transform high-volume streaming data with low latency.
Transform a single table with a streaming table
Streaming tables can be used to transform data from any Delta table or another streaming table.
The following caveat applies to this use case:
When you update the streaming table’s definition, existing data in the streaming table won’t be updated to reflect the change unless you fully refresh it.
Join a fact table with a dimension table (stream-snapshot join) using a streaming table
Streaming tables can join a fact table with a dimension table.
The following caveats apply to this use case:
When you update the streaming table’s definition, existing data in the streaming table won’t be updated to reflect the change unless you fully refresh it.
When you update the lookup table, existing data in the streaming table won’t be updated to reflect the change unless you fully refresh it.
Join two fact tables (stream-stream join) using a streaming table
Streaming tables can join two or more fact tables, also known as a stream-stream join.
The following caveats apply to this use case:
When you update the streaming table’s definition, existing data in the streaming table won’t be updated to reflect the change unless you fully refresh it.
To avoid out-of-memory errors, you must use watermarks on both sides of the join and in aggregations.
Out-of-order and late-arrival data is not handled, which could lead to inaccurate data. Because of this, you need to manually handle out-of-order and late-arrival data.
See Use watermarks with stream-stream joins.