Load and transform data with Delta Live Tables
The articles in this section provide common patterns, recommendations, and examples of data ingestion and transformation in Delta Live Tables pipelines. When ingesting source data to create the initial datasets in a pipeline, these initial datasets are commonly called bronze tables and often perform simple transformations. By contrast, the final tables in a pipeline, commonly referred to as gold tables, often require complicated aggregations or reading from sources that are the targets of an APPLY CHANGES INTO
operation.
Load data
You can load data from any data source supported by Apache Spark on Databricks using Delta Live Tables. For examples of patterns for loading data from different sources, including cloud object storage, message buses like Kafka, and external systems like PostgreSQL, see Load data with Delta Live Tables. These examples feature recommendations like using streaming tables with Auto Loader in Delta Live Tables for an optimized ingestion experience.
Data flows
In Delta Live Tables, a flow is a streaming query that processes source data incrementally to update a target streaming table. Many streaming queries needed to implement a Delta Live Tables pipeline create an implicit flow as part of the query definition. Delta Live Tables also supports explicitly declaring flows when more specialized processing is required. To learn more about Delta Live Tables flows and see examples of using flows to implement data processing tasks, see Load and process data incrementally with Delta Live Tables flows.
Change data capture (CDC)
Change data capture (CDC) is a data integration pattern that captures changes made to data in a source system, such as inserts, updates, and deletes. CDC is commonly used to efficiently replicate tables from a source system into Databricks. Delta Live Tables simplifies CDC with the APPLY CHANGES
API. By automatically handling out-of-sequence records, the APPLY CHANGES
API in Delta Live Tables ensures correct processing of CDC records and removes the need to develop complex logic for handling out-of-sequence records. See What is change data capture (CDC)? and The APPLY CHANGES APIs: Simplify change data capture with Delta Live Tables.
Transform data
With Delta Live Tables, you can declare transformations on datasets and specify how records are processed through query logic. For examples of common transformation patterns when building out Delta Live Tables pipelines, including usage of streaming tables, materialized views, stream-static joins, and MLflow models in pipelines, see Transform data with Delta Live Tables.
Optimize stateful processing in Delta Live Tables with watermarks
To effectively manage data kept in state, you can use watermarks when performing stateful stream processing in Delta Live Tables, including aggregations, joins, and deduplication. In stream processing, a watermark is an Apache Spark feature that can define a time-based threshold for processing data when performing stateful operations. Data arriving is processed until the threshold is reached, at which point the time window defined by the threshold is closed. Watermarks can be used to avoid problems during query processing, mainly when processing larger datasets or long-running processing.
For examples and recommendations, see Optimize stateful processing in Delta Live Tables with watermarks.