Query optimization using primary key constraints
Primary key constraints, which capture relationships between fields in tables, can help users and tools understand relationships in your data. This article contains examples that show how you can use primary keys with the RELY
option to optimize some common types of queries.
Add primary key constraints
You can add a primary key constraint in your table creation statement, as in the following example, or add a constraint to a table using the ADD CONSTRAINT
clause.
CREATE TABLE customer (
c_customer_sk int,
PRIMARY KEY (c_customer_sk)
...
)
In this example, c_customer_sk
is the customer ID key. The primary key constraint specifies that each customer ID value should be unique in the table. Databricks does not enforce key constraints. They can be validated through your existing data pipeline or ETL. See Manage data quality with pipeline expectations to learn about working expectations on streaming tables and materialized views. See Constraints on Databricks to learn about working with constraints on Delta tables.
Note
It is the user’s responsibility to check whether a constraint is satisfied. Relying on a constraint that is not satisfied can lead to incorrect query results.
Use RELY
to enable optimizations
When you know that a primary key constraint is valid, you can enable optimizations based on the constraint by specifying it with the RELY
option. See ADD CONSTRAINT clause for the complete syntax.
The RELY
option allows Databricks to exploit the constraint to rewrite queries. The following optimizations can only be performed if the RELY
option is specified in an ADD CONSTRAINT
clause or ALTER TABLE
statement.
Using ALTER TABLE
, you can modify a table’s primary key to include the RELY
option, as shown in the following example.
ALTER TABLE
customer DROP PRIMARY KEY;
ALTER TABLE
customer
ADD
PRIMARY KEY (c_customer_sk) RELY;
Optimization examples
The following examples extend the previous example that creates a customer
table where c_customer_sk
is a verified unique identifier named as a PRIMARY KEY
with the RELY
option specified.
Example 1: Eliminate unnecessary aggregations
The following shows a query that applies a DISTINCT
operation to a primary key.
SELECT
DISTINCT c_customer_sk
FROM
customer;
Because the c_customer_sk
column is a verified PRIMARY KEY
constraint, all values in the column are unique. With the RELY
option is specified, Databricks can optimize the query by not performing the DISTINCT
operation.
Example 2: Eliminate unnecessary joins
The following example shows a query where Databricks can eliminate an unnecessary join.
The query joins a fact table, store_sales
with a dimension table, customer
. It performs a left outer join, so the query result includes all records from the store_sales
table and matched records from the customer
table. If there is no matching record in the customer
table, the query result shows a NULL
value for the c_customer_sk
column.
SELECT
SUM(ss_quantity)
FROM
store_sales ss
LEFT JOIN customer c ON ss.customer_sk = c.c_customer_sk;
To understand why this join is unnecessary, consider the query statement. It requires only the ss_quantity
column from the store_sales
table. The customer
table is joined on its primary key, so each row of store_sales
matches at most one row in customer
. Because the operation is an outer join, all records from the store_sales
table are preserved, so the join does not change any data from that table. The SUM
aggregation is the same whether or not these tables are joined.
Using the primary key constraint with RELY
gives the query optimizer the information it needs to eliminate the join. The optimized query looks more like this:
SELECT
SUM(ss_quantity)
FROM
store_sales ss
Next steps
See View the Entity Relationship Diagram to learn how to explore primary key and foreign key relationships in the Catalog Explorer UI.