Configure pipeline settings for Delta Live Tables

This article provides details on configuring pipeline settings for Delta Live Tables. Delta Live Tables provides a user interface for configuring and editing pipeline settings. The UI also provides an option to display and edit settings in JSON.

Note

You can configure most settings with either the UI or a JSON specification. Some advanced options are only available using the JSON configuration.

Databricks recommends familiarizing yourself with Delta Live Tables settings using the UI. If necessary, you can directly edit the JSON configuration in the workspace. JSON configuration files are also useful when deploying pipelines to new environments or when using the CLI or REST API.

For a full reference to the Delta Live Tables JSON configuration settings, see Delta Live Tables pipeline configurations.

Choose a product edition

Select the Delta Live Tables product edition with the features best suited for your pipeline requirements. The following product editions are available:

  • Core to run streaming ingest workloads. Select the Core edition if your pipeline doesn’t require advanced features such as change data capture (CDC) or Delta Live Tables expectations.

  • Pro to run streaming ingest and CDC workloads. The Pro product edition supports all of the Core features, plus support for workloads that require updating tables based on changes in source data.

  • Advanced to run streaming ingest workloads, CDC workloads, and workloads that require expectations. The Advanced product edition supports the features of the Core and Pro editions, and also supports enforcement of data quality constraints with Delta Live Tables expectations.

You can select the product edition when you create or edit a pipeline. You can select a different edition for each pipeline. See the Delta Live Tables product page.

Note

If your pipeline includes features not supported by the selected product edition, for example, expectations, you will receive an error message with the reason for the error. You can then edit the pipeline to select the appropriate edition.

Choose a pipeline mode

You can update your pipeline continuously or with manual triggers based on the pipeline mode. See Continuous vs. triggered pipeline execution.

Select a cluster policy

Users must have permissions to deploy compute to configure and update Delta Live Tables pipelines. Workspace admins can configure cluster policies to provide users with access to compute resources for Delta Live Tables. See Define limits on Delta Live Tables pipeline compute.

Note

  • Cluster policies are optional. Check with your workspace administrator if you lack compute privileges required for Delta Live Tables.

  • To ensure that cluster policy default values are correctly applied, set the apply_policy_default_values value to true in the cluster configurations in your pipeline configuration:

    {
      "clusters": [
        {
          "apply_policy_default_values": true
        }
      ]
    }
    

Configure source code libraries

You can use the file selector in the Delta Live Tables UI to configure the source code defining your pipeline. Pipeline source code is defined in Databricks notebooks or in SQL or Python scripts stored in workspace files. When you create or edit your pipeline, you can add one or more notebooks or workspace files or a combination of notebooks and workspace files.

Because Delta Live Tables automatically analyzes dataset dependencies to construct the processing graph for your pipeline, you can add source code libraries in any order.

You can also modify the JSON file to include Delta Live Tables source code defined in SQL and Python scripts stored in workspace files. The following example includes notebooks and workspace files from Databricks Repos:

{
  "name": "Example pipeline 3",
  "storage": "dbfs:/pipeline-examples/storage-location/example3",
  "libraries": [
    { "notebook": { "path": "/example-notebook_1" } },
    { "notebook": { "path": "/example-notebook_2" } },
    { "file": { "path": "/Repos/<user-name>@databricks.com/Apply_Changes_Into/apply_changes_into.sql" } },
    { "file": { "path": "/Repos/<user-name>@databricks.com/Apply_Changes_Into/apply_changes_into.py" } }
  ]
}

Specify a storage location

You can specify a storage location for a pipeline that publishes to the Hive metastore. The primary motivation for specifying a location is to control the object storage location for data written by your pipeline.

Because all tables, data, checkpoints, and metadata for Delta Live Tables pipelines are fully managed by Delta Live Tables, most interaction with Delta Live Tables datasets happens through tables registered to the Hive metastore or Unity Catalog.

Specify a target schema for pipeline output tables

While optional, you should specify a target to publish tables created by your pipeline anytime you move beyond development and testing for a new pipeline. Publishing a pipeline to a target makes datasets available for querying elsewhere in your Databricks environment. See Publish data from Delta Live Tables pipelines to the Hive metastore or Use Unity Catalog with your Delta Live Tables pipelines.

Configure your compute settings

Each Delta Live Tables pipeline has two associated clusters:

  • The updates cluster processes pipeline updates.

  • The maintenance cluster runs daily maintenance tasks.

The configuration used by these clusters is determined by the clusters attribute specified in your pipeline settings.

You can add compute settings that apply to only a specific type of cluster by using cluster labels. There are three labels you can use when configuring pipeline clusters:

Note

The cluster label setting can be omitted if you are defining only one cluster configuration. The default label is applied to cluster configurations if no setting for the label is provided. The cluster label setting is required only if you need to customize settings for different cluster types.

  • The default label defines compute settings to apply to both the updates and maintenance clusters. Applying the same settings to both clusters improves the reliability of maintenance runs by ensuring that required configurations, for example, data access credentials for a storage location, are applied to the maintenance cluster.

  • The maintenance label defines compute settings to apply to only the maintenance cluster. You can also use the maintenance label to override settings configured by the default label.

  • The updates label defines settings to apply to only the updates cluster. Use the updates label to configure settings that should not be applied to the maintenance cluster.

Settings defined using the default and updates labels are merged to create the final configuration for the updates cluster. If the same setting is defined using both default and updates labels, the setting defined with the updates label overrides the setting defined with the default label.

The following example defines a Spark configuration parameter that is added only to the configuration for the updates cluster:

{
  "clusters": [
    {
      "label": "default",
      "autoscale": {
        "min_workers": 1,
        "max_workers": 5,
        "mode": "ENHANCED"
      }
    },
    {
      "label": "updates",
      "spark_conf": {
         "key": "value"
      }
    }
  ]
}

Delta Live Tables provides similar options for cluster settings as other compute on Databricks. Like other pipeline settings, you can modify the JSON configuration for clusters to specify options not present in the UI. See Compute.

Note

  • Because the Delta Live Tables runtime manages the lifecycle of pipeline clusters and runs a custom version of Databricks Runtime, you cannot manually set some cluster settings in a pipeline configuration, such as the Spark version or cluster names. See Cluster attributes that are not user settable.

  • You can configure Delta Live Tables pipelines to leverage Photon. See What is Photon?.

Select instance types to run a pipeline

By default, Delta Live Tables selects the instance types for the driver and worker nodes that run your pipeline, but you can also manually configure the instance types. For example, you might want to select instance types to improve pipeline performance or address memory issues when running your pipeline. You can configure instance types when you create or edit a pipeline with the REST API, or in the Delta Live Tables UI.

To configure instance types when you create or edit a pipeline in the Delta Live Tables UI:

  1. Click the Settings button.

  2. On the Pipeline settings page, click the JSON button.

  3. Enter the instance type configurations in the cluster configuration:

Note

To avoid assigning unnecessary resources to the maintenance cluster, this example uses the updates label to set the instance types for only the updates cluster. To assign the instance types to both updates and maintenance clusters, use the default label or omit the setting for the label. The default label is applied to pipeline cluster configurations if no setting for the label is provided. See Configure your compute settings.

{
  "clusters": [
    {
      "label": "updates",
      "node_type_id": "n1-highmem-16",
      "driver_node_type_id": "n1-standard-4",
      "..." : "..."
    }
  ]
}

Use autoscaling to increase efficiency and reduce resource usage

Use Enhanced Autoscaling to optimize the cluster utilization of your pipelines. Enhanced Autoscaling adds additional resources only if the system determines those resources will increase pipeline processing speed. Resources are freed when no longer needed, and clusters are shut down as soon as all pipeline updates are complete.

Use the following guidelines when configuring Enhanced Autoscaling for production pipelines:

  • Leave the Min workers setting at the default.

  • Set the Max workers setting to a value based on budget and pipeline priority.

Delay compute shutdown

Because a Delta Live Tables cluster automatically shuts down when not in use, referencing a cluster policy that sets autotermination_minutes in your cluster configuration results in an error. To control cluster shutdown behavior, you can use development or production mode or use the pipelines.clusterShutdown.delay setting in the pipeline configuration. The following example sets the pipelines.clusterShutdown.delay value to 60 seconds:

{
    "configuration": {
      "pipelines.clusterShutdown.delay": "60s"
    }
}

When production mode is enabled, the default value for pipelines.clusterShutdown.delay is 0 seconds. When development mode is enabled, the default value is 2 hours.

Create a single node cluster

If you set num_workers to 0 in cluster settings, the cluster is created as a Single Node cluster. Configuring an autoscaling cluster and setting min_workers to 0 and max_workers to 0 also creates a Single Node cluster.

If you configure an autoscaling cluster and set only min_workers to 0, then the cluster is not created as a Single Node cluster. The cluster has at least one active worker at all times until terminated.

An example cluster configuration to create a Single Node cluster in Delta Live Tables:

{
    "clusters": [
      {
        "num_workers": 0
      }
    ]
}

Configure cluster tags

You can use cluster tags to monitor usage for your pipeline clusters. Add cluster tags in the Delta Live Tables UI when you create or edit a pipeline, or by editing the JSON settings for your pipeline clusters.

Cloud storage configuration

To access a bucket in Google Cloud Storage (GCS), you must create a service account with access to that GCS bucket and add that service account to the cluster configurations. For more information about creating a Google Cloud service account, see Connect to Google Cloud Storage. You can add the service account configuration when you create or edit a pipeline with the Delta Live Tables API or in the Delta Live Tables UI:

  1. On the Pipeline details page for your pipeline, click the Settings button. The Pipeline settings page appears.

  2. Click the JSON button.

  3. Enter the service account configuration in the gcp_attributes.google_service_account field in the cluster configuration:

{
  "clusters": [
    {
      "gcp_attributes": {
        "google_service_account": "test-gcs-doc@databricks-dev.iam.gserviceaccount.com"
      }
    }
  ]
}

Parameterize pipelines

The Python and SQL code that defines your datasets can be parameterized by the pipeline’s settings. Parameterization enables the following use cases:

  • Separating long paths and other variables from your code.

  • Reducing the amount of data processed in development or staging environments to speed up testing.

  • Reusing the same transformation logic to process from multiple data sources.

The following example uses the startDate configuration value to limit the development pipeline to a subset of the input data:

CREATE OR REFRESH LIVE TABLE customer_events
AS SELECT * FROM sourceTable WHERE date > '${mypipeline.startDate}';
@dlt.table
def customer_events():
  start_date = spark.conf.get("mypipeline.startDate")
  return read("sourceTable").where(col("date") > start_date)
{
  "name": "Data Ingest - DEV",
  "configuration": {
    "mypipeline.startDate": "2021-01-02"
  }
}
{
  "name": "Data Ingest - PROD",
  "configuration": {
    "mypipeline.startDate": "2010-01-02"
  }
}

Pipelines trigger interval

You can use pipelines.trigger.interval to control the trigger interval for a flow updating a table or an entire pipeline. Because a triggered pipeline processes each table only once, the pipelines.trigger.interval is used only with continuous pipelines.

Databricks recommends setting pipelines.trigger.interval on individual tables because of different defaults for streaming versus batch queries. Set the value on a pipeline only when your processing requires controlling updates for the entire pipeline graph.

You set pipelines.trigger.interval on a table using spark_conf in Python, or SET in SQL:

@dlt.table(
  spark_conf={"pipelines.trigger.interval" : "10 seconds"}
)
def <function-name>():
    return (<query>)
SET pipelines.trigger.interval=10 seconds;

CREATE OR REFRESH LIVE TABLE TABLE_NAME
AS SELECT ...

To set pipelines.trigger.interval on a pipeline, add it to the configuration object in the pipeline settings:

{
  "configuration": {
    "pipelines.trigger.interval": "10 seconds"
  }
}

Allow non-admin users to view the driver logs from a Unity Catalog-enabled pipeline

By default, only the pipeline owner and workspace admins have permission to view the driver logs from the cluster that runs a Unity Catalog-enabled pipeline. You can enable access to the driver logs for any user with CAN MANAGE, CAN VIEW, or CAN RUN permissions by adding the following Spark configuration parameter to the configuration object in the pipeline settings:

{
  "configuration": {
    "spark.databricks.acl.needAdminPermissionToViewLogs": "false"
  }
}

Add email notifications for pipeline events

You can configure one or more email addresses to receive notifications when the following occurs:

  • A pipeline update completes successfully.

  • A pipeline update fails, either with a retryable or a non-retryable error. Select this option to receive a notification for all pipeline failures.

  • A pipeline update fails with a non-retryable (fatal) error. Select this option to receive a notification only when a non-retryable error occurs.

  • A single data flow fails.

To configure email notifications when you create or edit a pipeline:

  1. Click Add notification.

  2. Enter one or more email addresses to receive notifications.

  3. Click the check box for each notification type to send to the configured email addresses.

  4. Click Add notification.

Control tombstone management for SCD type 1 queries

The following settings can be used to control the behavior of tombstone management for DELETE events during SCD type 1 processing:

  • pipelines.applyChanges.tombstoneGCThresholdInSeconds: Set this value to match the highest expected interval, in seconds, between out-of-order data. The default is 172800 seconds (2 days).

  • pipelines.applyChanges.tombstoneGCFrequencyInSeconds: This setting controls how frequently, in seconds, tombstones are checked for cleanup. The default is 1800 seconds (30 minutes).

See Simplified change data capture with the APPLY CHANGES API in Delta Live Tables.