Tutorial: Declare a data pipeline with SQL in Delta Live Tables

This tutorial shows you how to use SQL syntax to declare a data pipeline with Delta Live Tables. Databricks recommends Delta Live Tables with SQL as the preferred way for SQL users to build new ETL, ingestion, and transformation pipelines on Databricks. SQL syntax for Delta Live Tables extends standard Spark SQL with many new keywords, constructs, and table-valued functions. These additions to standard SQL allow users to declare dependencies between datasets and deploy production-grade infrastructure without needing to learn any new tooling or additional concepts.

For users familiar with Spark DataFrames that desire extensive testing and support for metaprogramming operations, Databricks recommends using Python for Delta Live Tables. See Tutorial: Declare a data pipeline with Python in Delta Live Tables.

Note

  • You cannot mix languages within a Delta Live Tables source file. You can use multiple notebooks or files with different languages in a pipeline.

  • To use the code in this example, select Hive metastore as the storage option when you create the pipeline. Because this example reads data from DBFS, you cannot run this example with a pipeline configured to use Unity Catalog as the storage option.

Where do you run Delta Live Tables SQL queries?

You must add your SQL files to a pipeline configuration to process query logic. To learn about executing logic defined in Delta Live Tables, see Tutorial: Run your first Delta Live Tables pipeline.

While you can use notebooks or SQL files to write Delta Live Tables SQL queries, Delta Live Tables is not designed to run interactively in notebook cells. Executing a cell that contains Delta Live Tables syntax in a Databricks notebook returns a message about whether the query is syntactically valid, but does not run query logic.

Declare a Delta Live Tables pipeline with SQL

This tutorial uses SQL syntax to declare a Delta Live Tables pipeline on a dataset containing Wikipedia clickstream data to:

  • Read the raw JSON clickstream data into a table.

  • Read the records from the raw data table and use Delta Live Tables expectations to create a new table that contains cleansed data.

  • Use the records from the cleansed data table to make Delta Live Tables queries that create derived datasets.

This code demonstrates a simplified example of the medallion architecture. See What is the medallion lakehouse architecture?.

Copy the SQL code and paste it into a new notebook. You can add the example code to a single cell of the notebook or multiple cells. To review options for creating notebooks, see Create a notebook.

Create a table from files in object storage

Delta Live Tables supports loading data from all formats supported by Databricks. See Data format options.

All Delta Live Tables SQL statements use CREATE OR REFRESH syntax and semantics. When you update a pipeline, Delta Live Tables determines whether the logically correct result for the table can be accomplished through incremental processing or if full recomputation is required.

The following example creates a table by loading data from JSON files stored in object storage:

CREATE OR REFRESH LIVE TABLE clickstream_raw
COMMENT "The raw wikipedia clickstream dataset, ingested from /databricks-datasets."
AS SELECT * FROM json.`/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json`;

Add a table from an upstream dataset to the pipeline

You can use the live virtual schema to query data from other datasets declared in your current Delta Live Tables pipeline. Declaring new tables in this way creates a dependency that Delta Live Tables automatically resolves before executing updates. The live schema is a custom keyword implemented in Delta Live Tables that can be substituted for a target schema if you wish to publish your datasets. See Publish data from Delta Live Tables pipelines to the Hive metastore.

The following code also includes examples of monitoring and enforcing data quality with expectations. See Manage data quality with Delta Live Tables.

CREATE OR REFRESH LIVE TABLE clickstream_prepared(
  CONSTRAINT valid_current_page EXPECT (current_page_title IS NOT NULL),
  CONSTRAINT valid_count EXPECT (click_count > 0) ON VIOLATION FAIL UPDATE
)
COMMENT "Wikipedia clickstream data cleaned and prepared for analysis."
AS SELECT
  curr_title AS current_page_title,
  CAST(n AS INT) AS click_count,
  prev_title AS previous_page_title
FROM live.clickstream_raw;

Create an enriched data view

Because Delta Live Tables processes updates to pipelines as a series of dependency graphs, you can declare highly enriched views that power dashboards, BI, and analytics by declaring tables with specific business logic.

Live tables are equivalent conceptually to materialized views. Whereas traditional views on Spark execute logic each time the view is queried, live tables store the most recent version of query results in data files. Because Delta Live Tables manages updates for all datasets in a pipeline, you can schedule pipeline updates to match latency requirements for materialized views and know that queries against these tables contain the most recent version of data available.

The following code creates an enriched materialized view of upstream data:

CREATE OR REFRESH LIVE TABLE top_spark_referers
COMMENT "A table containing the top pages linking to the Apache Spark page."
AS SELECT
  previous_page_title as referrer,
  click_count
FROM live.clickstream_prepared
WHERE current_page_title = 'Apache_Spark'
ORDER BY click_count DESC
LIMIT 10;

Next steps

To learn more, see Delta Live Tables SQL language reference.