Incremental refresh for materialized views

Preview

This feature is in Public Preview.

This article outlines the semantics and requirements for incremental refreshes on materialized views, and identifies the SQL operations, keywords, and clauses that support incremental refresh. It includes discussion of the differences between incremental and full refreshes, and includes recommendations for choosing between materialized views and streaming tables.

When running updates on materialized views using serverless pipelines, many queries can be incrementally refreshed. Incremental refreshes save compute costs by detecting changes in the data sources used to define the materialized view and incrementally computing the result.

Serverless pipelines are required for incremental refresh

Incremental refresh for materialized views requires serverless pipelines.

Refresh operations for materialized views defined in Databricks SQL always run using serverless pipelines.

For materialized views defined using Delta Live Tables pipelines, you must configure the pipeline to use serverless. See Configure a serverless Delta Live Tables pipeline.

What are the refresh semantics for materialized views?

Materialized views guarantee equivalent results to batch queries. For example, consider the following aggregate query:

SELECT account_id,
  COUNT(txn_id) txn_count,
  SUM(txn_amount) account_revenue
FROM transactions_table
GROUP BY account_id

When you run this query using any Databricks product, the result is computed using batch semantics to aggregate all records in the source transactions_table, meaning that all source data is scanned and aggregated in one operation.

Note

Some Databricks products cache results automatically within or across sessions if data sources have not changed after the last query has run. Automatic caching behaviors differ from materialized views.

The following example turns this batch query into a materialized view:

CREATE OR REPLACE MATERIALIZED VIEW transation_summary AS
SELECT account_id,
  COUNT(txn_id) txn_count,
  SUM(txn_amount) account_revenue
FROM transactions_table
GROUP BY account_id

When you refresh a materialized view, the computed result is identical to the batch query semantics. This query is an example of a materialized view that can be incrementally refreshed, meaning that the refresh operation makes a best-effort attempt to only process new or changed data in the source transactions_table to compute the results.

Data source considerations for materialized views

While you can define a materialized view against any data source, not all data sources are well-suited to materialized views. Consider the following caveats and recommendations:

Important

Materialized views make a best-effort attempt to incrementally refresh results for supported operations. Some changes in data sources require a full refresh.

All data sources for materialized views should be robust to full refresh semantics, even if the query that defines the materialized view supports incremental refresh.

  • For queries where a full refresh would be cost-prohibitive, use streaming tables to guarantee exactly-once processing. Examples include very large tables.

  • Do not define a materialized view against a data source if records should only be processed once. Instead, use streaming tables. Examples include the following:

    • Data sources that don’t retain data history, such as Kafka.

    • Ingest operations, such as queries that use Auto Loader to ingest data from cloud object storage.

    • Any data source where you plan to delete or archive data after processing but need to retain information in downstream tables. For example, a date-partitioned table where you plan to delete records older than a certain threshold.

  • Not all data sources support incremental refreshes. The following data sources support incremental refresh:

    • Delta tables, including Unity Catalog managed tables and external tables backed by Delta Lake.

    • Materialized views.

    • Streaming tables, including the targets of APPLY CHANGES INTO operations.

  • Some incremental refresh operations require row-tracking to be enabled on the queried data sources. Row-tracking is a Delta Lake feature only supported by Delta tables, which include materialized views, streaming tables, and Unity Catalog managed tables. See Use row tracking for Delta tables.

Optimize materialized views

To get the best performance, Databricks recommends enabling the following features on all materialized view source tables:

Refresh types for materialized views

Refreshes to materialized views are either full or incremental. For all operations, the results of an incremental refresh and full refresh are the same. Databricks runs a cost analysis to identify if changes to data sources require a full refresh.

To determine which refresh type an update used, see Determine the refresh type of an update.

Full refresh

A full refresh overwrites the results in the materialized view by reprocessing all data available in the source. All materialized views might be fully refreshed on any given update, depending on how the data sources have changed.

You can optionally force a full refresh. For materialized views defined using Databricks SQL, use the following syntax:

REFRESH MATERIALIZED VIEW mv_name FULL

For materialized views defined in a Delta Live Tables pipeline, you can choose to run a full refresh on selected datasets or on all datasets in a pipeline. See Pipeline refresh semantics.

Important

When a full refresh runs against a data source where records have been removed due to data retention threshold or manual deletion, removed records are not reflected in computed results. You may be unable to recover old data if the data is no longer available in the source.

Note

You can optionally disable full refreshes on a table by setting the table property pipelines.reset.allowed to false.

Incremental refresh

An incremental refresh processes changes in the underlying data after the last refresh and then appends that data to the table. Depending on the base tables and included operations, only certain types of materialized views can be incrementally refreshed.

Only materialized views updated using serverless pipelines can use incremental refresh. Materialized views that do not use serverless pipelines are always fully refreshed.

When materialized views are created using a SQL warehouse or serverless Delta Live Tables pipeline, they are automatically incrementally refreshed if their queries are supported. If a query includes unsupported expressions for an incremental refresh, a full refresh is performed, potentially resulting in additional costs.

Support for materialized view incremental refresh

The following table lists support for incremental refresh by SQL keyword or clause.

Important

Some keywords and clauses require row-tracking to be enabled on the queried data sources. See Use row tracking for Delta tables.

These keywords and clauses are marked with a star (*) in the following table.

SQL keyword or clause

Support for incremental refresh

SELECT expressions*

Yes, expressions including deterministic built-in functions and immutable user-defined functions (UDFs) are supported.

GROUP BY

Yes

WITH

Yes, common table expressions are supported.

UNION ALL*

Yes

FROM

Supported base tables include Delta tables, materialized views, and streaming tables.

WHERE, HAVING*

Filter clauses such as WHERE and HAVING are supported.

INNER JOIN*

Yes

LEFT OUTER JOIN*

Yes

FULL OUTER JOIN*

Yes

RIGHT OUTER JOIN*

Yes

OVER

Yes. PARTITION_BY columns must be specified for incrementalization on window functions.

QUALIFY

Yes

EXPECTATIONS

No. Materialized views that use expectations are always fully refreshed.

Note

Non-deterministic functions, for example, CURRENT_TIMESTAMP, are not supported.

Determine the refresh type of an update

To optimize the performance of materialized view refreshes, Databricks uses a cost model to select the technique used for the refresh. The following table describes these techniques:

Technique

Incremental refresh?

Description

FULL_RECOMPUTE

No

The materialized view was fully recomputed

NO_OP

Not applicable

The materialized view was not updated because no changes to the base table were detected.

ROW_BASED or PARTITION_OVERWRITE

Yes

The materialized view was incrementally refreshed using the specified technique.

To determine the technique used, query the Delta Live Tables event log where the event_type is planning_information:

SELECT
  timestamp,
  message
FROM
  event_log(TABLE(<fully-qualified-table-name>))
WHERE
  event_type = 'planning_information'
ORDER BY
  timestamp desc;

Replace <fully-qualified-table-name> with the fully qualified name of the materialized view, including the catalog and schema.

See What is the Delta Live Tables event log?.