Run an update on a Delta Live Tables pipeline

This article explains what a Delta Live Tables pipeline update is and how to run one.

After you create a pipeline and are ready to run it, you start an update. A pipeline update does the following:

  • Starts a cluster with the correct configuration.

  • Discovers all the tables and views defined, and checks for any analysis errors such as invalid column names, missing dependencies, and syntax errors.

  • Creates or updates tables and views with the most recent data available.

You can check for problems in a pipeline’s source code without waiting for tables to be created or updated using a Validate update. The Validate feature is useful when developing or testing pipelines by allowing you to quickly find and fix errors in your pipeline, such as incorrect table or column names.

Start a pipeline update

Databricks provides several options to start pipeline updates, including the following:

  • In the Delta Live Tables UI, you have the following options:

    • Click the Delta Live Tables Start Icon button on the pipeline details page.

    • From the pipelines list, click Right Arrow Icon in the Actions column.

  • To start an update in a notebook, click Delta Live Tables > Start in the notebook toolbar. See Open or run a Delta Live Tables pipeline from a notebook.

  • You can trigger pipelines programmatically using the API or CLI. See Delta Live Tables API guide.

  • You can schedule the pipeline as a job using the Delta Live Tables UI or the jobs UI. See Schedule a pipeline.

How Delta Live Tables updates tables and views

The tables and views updated, and how those tables are views are updated, depends on the update type:

  • Refresh all: All live tables are updated to reflect the current state of their input data sources. For all streaming tables, new rows are appended to the table.

  • Full refresh all: All live tables are updated to reflect the current state of their input data sources. For all streaming tables, Delta Live Tables attempts to clear all data from each table and then load all data from the streaming source.

  • Refresh selection: The behavior of refresh selection is identical to refresh all, but allows you to refresh only selected tables. Selected live tables are updated to reflect the current state of their input data sources. For selected streaming tables, new rows are appended to the table.

  • Full refresh selection: The behavior of full refresh selection is identical to full refresh all, but allows you to perform a full refresh of only selected tables. Selected live tables are updated to reflect the current state of their input data sources. For selected streaming tables, Delta Live Tables attempts to clear all data from each table and then load all data from the streaming source.

For existing live tables, an update has the same behavior as a SQL REFRESH on a materialized view. For new live tables, the behavior is the same as a SQL CREATE operation.

Start a pipeline update for selected tables

You may want to reprocess data for only selected tables in your pipeline. For example, during development, you only change a single table and want to reduce testing time, or a pipeline update fails and you want to refresh only the failed tables.

Note

You can use selective refresh with only triggered pipelines.

To start an update that refreshes selected tables only, on the Pipeline details page:

  1. Click Select tables for refresh. The Select tables for refresh dialog appears.

    If you do not see the Select tables for refresh button, make sure the Pipeline details page displays the latest update, and the update is complete. If a DAG is not displayed for the latest update, for example, because the update failed, the Select tables for refresh button is not displayed.

  2. To select the tables to refresh, click on each table. The selected tables are highlighted and labeled. To remove a table from the update, click on the table again.

  3. Click Refresh selection.

    Note

    The Refresh selection button displays the number of selected tables in parentheses.

To reprocess data that has already been ingested for the selected tables, click Blue Down Caret next to the Refresh selection button and click Full Refresh selection.

Start a pipeline update for failed tables

If a pipeline update fails because of errors in one or more tables in the pipeline graph, you can start an update of only failed tables and any downstream dependencies.

Note

Excluded tables are not refreshed, even if they depend on a failed table.

To update failed tables, on the Pipeline details page, click Refresh failed tables.

To update only selected failed tables:

  1. Click Button Down next to the Refresh failed tables button and click Select tables for refresh. The Select tables for refresh dialog appears.

  2. To select the tables to refresh, click on each table. The selected tables are highlighted and labeled. To remove a table from the update, click on the table again.

  3. Click Refresh selection.

    Note

    The Refresh selection button displays the number of selected tables in parentheses.

To reprocess data that has already been ingested for the selected tables, click Blue Down Caret next to the Refresh selection button and click Full Refresh selection.

Check a pipeline for errors without waiting for tables to update

Preview

The Delta Live Tables Validate update feature is in Public Preview.

To check whether a pipeline’s source code is valid without running a full update, use Validate. A Validate update resolves the definitions of datasets and flows defined in the pipeline but does not materialize or publish any datasets. Errors found during validation, such as incorrect table or column names, are reported in the UI.

To run a Validate update, on the pipeline details page click Blue Down Caret next to Start and click Validate.

After the Validate update completes, the event log shows events related only to the Validate update, and no metrics are displayed in the DAG. If errors are found, details are available in the event log.

You can see results for only the most recent Validate update. If the Validate update was the most recently run update, you can see the results by selecting it in the update history. If another update is run after the Validate update, the results are no longer available in the UI.

Continuous vs. triggered pipeline execution

If the pipeline uses the triggered execution mode, the system stops processing after successfully refreshing all tables or selected tables in the pipeline once, ensuring each table that is part of the update is updated based on the data available when the update started.

If the pipeline uses continuous execution, Delta Live Tables processes new data as it arrives in data sources to keep tables throughout the pipeline fresh.

The execution mode is independent of the type of table being computed. Both materialized views and streaming tables can be updated in either execution mode. To avoid unnecessary processing in continuous execution mode, pipelines automatically monitor dependent Delta tables and perform an update only when the contents of those dependent tables have changed.

Note

The Delta Live Tables runtime is not able to detect changes in non-Delta data sources. The table is still updated regularly, but with a higher default trigger interval to prevent excessive recomputation from slowing down any incremental processing happening on the cluster.

Table comparing data pipeline execution modes

The following table highlights differences between these execution modes:

Triggered

Continuous

When does the update stop?

Automatically once complete.

Runs continuously until manually stopped.

What data is processed?

Data available when the update is started.

All data as it arrives at configured sources.

What data freshness requirements is this best for?

Data updates run every 10 minutes, hourly, or daily.

Data updates desired between every 10 seconds and a few minutes.

Triggered pipelines can reduce resource consumption and expense since the cluster runs only long enough to execute the pipeline. However, new data won’t be processed until the pipeline is triggered. Continuous pipelines require an always-running cluster, which is more expensive but reduces processing latency.

You can configure execution mode with the Pipeline mode option in the settings.

How to choose pipeline boundaries

A Delta Live Tables pipeline can process updates to a single table, many tables with dependent relationship, many tables without relationships, or multiple independent flows of tables with dependent relationships. This section contains considerations to help determine how to break up your pipelines.

Larger Delta Live Tables pipelines have a number of benefits. These include the following:

  • More efficiently use cluster resources.

  • Reduce the number of pipelines in your workspace.

  • Reduce the complexity of workflow orchestration.

Some common recommendations on how processing pipelines should be split include the following:

  • Split functionality at team boundaries. For example, your data team may maintain pipelines to transform data while your data analysts maintain pipelines that analyze the transformed data.

  • Split functionality at application-specific boundaries to reduce coupling and facilitate the re-use of common functionality.

Development and production modes

You can optimize pipeline execution by switching between development and production modes. Use the Delta Live Tables Environment Toggle Icon buttons in the Pipelines UI to switch between these two modes. By default, pipelines run in development mode.

When you run your pipeline in development mode, the Delta Live Tables system does the following:

  • Reuses a cluster to avoid the overhead of restarts. By default, clusters run for two hours when development mode is enabled. You can change this with the pipelines.clusterShutdown.delay setting in the Configure your compute settings.

  • Disables pipeline retries so you can immediately detect and fix errors.

In production mode, the Delta Live Tables system does the following:

  • Restarts the cluster for specific recoverable errors, including memory leaks and stale credentials.

  • Retries execution in the event of specific errors, for example, a failure to start a cluster.

Note

Switching between development and production modes only controls cluster and pipeline execution behavior. Storage locations and target schemas in the catalog for publishing tables must be configured as part of pipeline settings and are not affected when switching between modes.

Schedule a pipeline

You can start a triggered pipeline manually or run the pipeline on a schedule with a Databricks job. You can create and schedule a job with a single pipeline task directly in the Delta Live Tables UI or add a pipeline task to a multi-task workflow in the jobs UI.

To create a single-task job and a schedule for the job in the Delta Live Tables UI:

  1. Click Schedule > Add a schedule. The Schedule button is updated to show the number of existing schedules if the pipeline is included in one or more scheduled jobs, for example, Schedule (5).

  2. Enter a name for the job in the Job name field.

  3. Set the Schedule to Scheduled.

  4. Specify the period, starting time, and time zone.

  5. Configure one or more email addresses to receive alerts on pipeline start, success, or failure.

  6. Click Create.