Query optimization using primary key constraints

Primary key constraints, which capture relationships between fields in tables, can help users and tools understand relationships in your data. This article contains examples that show how you can use primary keys with the RELY option to optimize some common types of queries.

Add primary key constraints

You can add a primary key constraint in your table creation statement, as in the following example, or add a constraint to a table using the ADD CONSTRAINT clause.

CREATE TABLE customer (
  c_customer_sk int,
  PRIMARY KEY (c_customer_sk)
  ...
  )

In this example, c_customer_sk is the customer ID key. The primary key constraint specifies that each customer ID value should be unique in the table. Databricks does not enforce key constraints. They can be validated through your existing data pipeline or ETL. See Manage data quality with pipeline expectations to learn about working expectations on streaming tables and materialized views. See Constraints on Databricks to learn about working with constraints on Delta tables.

Note

It is the user’s responsibility to check whether a constraint is satisfied. Relying on a constraint that is not satisfied can lead to incorrect query results.

Use RELY to enable optimizations

When you know that a primary key constraint is valid, you can enable optimizations based on the constraint by specifying it with the RELY option. See ADD CONSTRAINT clause for the complete syntax.

The RELY option allows Databricks to exploit the constraint to rewrite queries. The following optimizations can only be performed if the RELY option is specified in an ADD CONSTRAINT clause or ALTER TABLE statement.

Using ALTER TABLE, you can modify a table’s primary key to include the RELY option, as shown in the following example.

ALTER TABLE
  customer DROP PRIMARY KEY;
ALTER TABLE
  customer
ADD
  PRIMARY KEY (c_customer_sk) RELY;

Optimization examples

The following examples extend the previous example that creates a customer table where c_customer_sk is a verified unique identifier named as a PRIMARY KEY with the RELY option specified.

Example 1: Eliminate unnecessary aggregations

The following shows a query that applies a DISTINCT operation to a primary key.

SELECT
  DISTINCT c_customer_sk
FROM
  customer;

Because the c_customer_sk column is a verified PRIMARY KEY constraint, all values in the column are unique. With the RELY option is specified, Databricks can optimize the query by not performing the DISTINCT operation.

Example 2: Eliminate unnecessary joins

The following example shows a query where Databricks can eliminate an unnecessary join.

The query joins a fact table, store_sales with a dimension table, customer. It performs a left outer join, so the query result includes all records from the store_sales table and matched records from the customer table. If there is no matching record in the customer table, the query result shows a NULL value for the c_customer_sk column.

SELECT
  SUM(ss_quantity)
FROM
  store_sales ss
  LEFT JOIN customer c ON ss.customer_sk = c.c_customer_sk;

To understand why this join is unnecessary, consider the query statement. It requires only the ss_quantity column from the store_sales table. The customer table is joined on its primary key, so each row of store_sales matches at most one row in customer. Because the operation is an outer join, all records from the store_sales table are preserved, so the join does not change any data from that table. The SUM aggregation is the same whether or not these tables are joined.

Using the primary key constraint with RELY gives the query optimizer the information it needs to eliminate the join. The optimized query looks more like this:

SELECT
  SUM(ss_quantity)
FROM
  store_sales ss

Next steps

See View the Entity Relationship Diagram to learn how to explore primary key and foreign key relationships in the Catalog Explorer UI.