Incremental refresh for materialized views
/shared/preview.md
This article outlines the semantics and requirements for incremental refreshes on materialized views, and identifies the SQL operations, keywords, and clauses that support incremental refresh. It includes discussion of the differences between incremental and full refreshes, and includes recommendations for choosing between materialized views and streaming tables.
When running updates on materialized views using serverless pipelines, many queries can be incrementally refreshed. Incremental refreshes save compute costs by detecting changes in the data sources used to define the materialized view and incrementally computing the result.
Serverless pipelines are required for incremental refresh
Incremental refresh for materialized views requires serverless pipelines.
Refresh operations for materialized views defined in Databricks SQL always run using serverless pipelines.
For materialized views defined using Delta Live Tables pipelines, you must configure the pipeline to use serverless. See Configure a serverless Delta Live Tables pipeline.
What are the refresh semantics for materialized views?
Materialized views guarantee equivalent results to batch queries. For example, consider the following aggregate query:
SELECT account_id,
COUNT(txn_id) txn_count,
SUM(txn_amount) account_revenue
FROM transactions_table
GROUP BY account_id
When you run this query using any Databricks product, the result is computed using batch semantics to aggregate all records in the source transactions_table
, meaning that all source data is scanned and aggregated in one operation.
Note
Some Databricks products cache results automatically within or across sessions if data sources have not changed after the last query has run. Automatic caching behaviors differ from materialized views.
The following example turns this batch query into a materialized view:
CREATE OR REFRESH MATERIALIZED VIEW transation_summary AS
SELECT account_id,
COUNT(txn_id) txn_count,
SUM(txn_amount) account_revenue
FROM transactions_table
GROUP BY account_id
When you refresh a materialized view, the computed result is identical to the batch query semantics. This query is an example of a materialized view that can be incrementally refreshed, meaning that the refresh operation makes a best-effort attempt to only process new or changed data in the source transactions_table
to compute the results.
Data source considerations for materialized views
While you can define a materialized view against any data source, not all data sources are well-suited to materialized views. Consider the following caveats and recommendations:
Important
Materialized views make a best-effort attempt to incrementally refresh results for supported operations. Some changes in data sources require a full refresh.
All data sources for materialized views should be robust to full refresh semantics, even if the query that defines the materialized view supports incremental refresh.
For queries where a full refresh would be cost-prohibitive, use streaming tables to guarantee exactly-once processing. Examples include very large tables.
Do not define a materialized view against a data source if records should only be processed once. Instead, use streaming tables. Examples include the following:
Data sources that don’t retain data history, such as Kafka.
Ingest operations, such as queries that use Auto Loader to ingest data from cloud object storage.
Any data source where you plan to delete or archive data after processing but need to retain information in downstream tables. For example, a date-partitioned table where you plan to delete records older than a certain threshold.
Not all data sources support incremental refreshes. The following data sources support incremental refresh:
Delta tables, including Unity Catalog managed tables and external tables backed by Delta Lake.
Materialized views.
Streaming tables, including the targets of
APPLY CHANGES INTO
operations.
Some incremental refresh operations require row-tracking to be enabled on the queried data sources. Row-tracking is a Delta Lake feature only supported by Delta tables, which include materialized views, streaming tables, and Unity Catalog managed tables. See Use row tracking for Delta tables.
Optimize materialized views
To get the best performance, Databricks recommends enabling the following features on all materialized view source tables:
Refresh types for materialized views
Refreshes to materialized views are either full or incremental. For all operations, the results of an incremental refresh and full refresh are the same. Databricks runs a cost analysis to identify if changes to data sources require a full refresh.
To determine which refresh type an update used, see Determine the refresh type of an update.
Full refresh
A full refresh overwrites the results in the materialized view by reprocessing all data available in the source. All materialized views might be fully refreshed on any given update, depending on how the data sources have changed.
You can optionally force a full refresh. For materialized views defined using Databricks SQL, use the following syntax:
REFRESH MATERIALIZED VIEW mv_name FULL
For materialized views defined in a Delta Live Tables pipeline, you can choose to run a full refresh on selected datasets or on all datasets in a pipeline. See How Delta Live Tables updates tables and views.
Important
When a full refresh runs against a data source where records have been removed due to data retention threshold or manual deletion, removed records are not reflected in computed results. You may be unable to recover old data if the data is no longer available in the source.
Note
You can optionally disable full refreshes on a table by setting the table property pipelines.reset.allowed
to false
.
Incremental refresh
An incremental refresh processes changes in the underlying data after the last refresh and then appends that data to the table. Depending on the base tables and included operations, only certain types of materialized views can be incrementally refreshed.
Only materialized views updated using serverless pipelines can use incremental refresh. Materialized views that do not use serverless pipelines are always fully refreshed.
When materialized views are created using a SQL warehouse or serverless Delta Live Tables pipeline, they are automatically incrementally refreshed if their queries are supported. If a query includes unsupported expressions for an incremental refresh, a full refresh is performed, potentially resulting in additional costs.
Support for materialized view incremental refresh
The following table lists support for incremental refresh by SQL keyword or clause.
Important
Some keywords and clauses require row-tracking to be enabled on the queried data sources. See Use row tracking for Delta tables.
These keywords and clauses are marked with a star (*) in the following table.
SQL keyword or clause |
Support for incremental refresh |
---|---|
|
Yes, expressions including deterministic built-in functions and immutable user-defined functions (UDFs) are supported. |
|
Yes |
|
Yes, common table expressions are supported. |
|
Yes |
|
Supported base tables include Delta tables, materialized views, and streaming tables. |
|
Filter clauses such as |
|
Yes |
|
Yes |
|
Yes |
|
Yes |
|
Yes. |
|
Yes |
|
No. Materialized views that use expectations are always fully refreshed. |
Note
Non-deterministic functions, for example, CURRENT_TIMESTAMP
, are not supported.
Determine the refresh type of an update
To optimize the performance of materialized view refreshes, Databricks uses a cost model to select the technique used for the refresh. The following table describes these techniques:
Technique |
Incremental refresh? |
Description |
---|---|---|
|
No |
The materialized view was fully recomputed |
|
Not applicable |
The materialized view was not updated because no changes to the base table were detected. |
|
Yes |
The materialized view was incrementally refreshed using the specified technique. |
To determine the technique used, query the Delta Live Tables event log where the event_type
is planning_information
:
SELECT
timestamp,
message
FROM
event_log(TABLE(<fully-qualified-table-name>))
WHERE
event_type = 'planning_information'
ORDER BY
timestamp desc;
Replace <fully-qualified-table-name>
with the fully qualified name of the materialized view, including the catalog and schema.