Delta Lake generated columns

Preview

Delta Lake supports generated columns which are a special type of column whose values are automatically generated based on a user-specified function over other columns in the Delta table. When you write to a table with generated columns and you do not explicitly provide values for them, Delta Lake automatically computes the values. For example, you can automatically generate a date column (for partitioning the table by date) from the timestamp column; any writes into the table need only specify the data for the timestamp column. However, if you explicitly provide values for them, the values must satisfy the constraint (<value> <=> <generation expression>) IS TRUE or the write will fail with an error.

Important

Tables created with generated columns have a higher table writer protocol version than the default. See How does Databricks manage Delta Lake feature compatibility? to understand table protocol versioning and what it means to have a higher version of a table protocol version.

Create a table with generated columns

The following example shows how to create a table with generated columns:

CREATE TABLE default.people10m (
  id INT,
  firstName STRING,
  middleName STRING,
  lastName STRING,
  gender STRING,
  birthDate TIMESTAMP,
  dateOfBirth DATE GENERATED ALWAYS AS (CAST(birthDate AS DATE)),
  ssn STRING,
  salary INT
)

DeltaTable.create(spark) \
  .tableName("default.people10m") \
  .addColumn("id", "INT") \
  .addColumn("firstName", "STRING") \
  .addColumn("middleName", "STRING") \
  .addColumn("lastName", "STRING", comment = "surname") \
  .addColumn("gender", "STRING") \
  .addColumn("birthDate", "TIMESTAMP") \
  .addColumn("dateOfBirth", DateType(), generatedAlwaysAs="CAST(birthDate AS DATE)") \
  .addColumn("ssn", "STRING") \
  .addColumn("salary", "INT") \
  .execute()

DeltaTable.create(spark)
  .tableName("default.people10m")
  .addColumn("id", "INT")
  .addColumn("firstName", "STRING")
  .addColumn("middleName", "STRING")
  .addColumn(
    DeltaTable.columnBuilder("lastName")
      .dataType("STRING")
      .comment("surname")
      .build())
  .addColumn("lastName", "STRING", comment = "surname")
  .addColumn("gender", "STRING")
  .addColumn("birthDate", "TIMESTAMP")
  .addColumn(
    DeltaTable.columnBuilder("dateOfBirth")
     .dataType(DateType)
     .generatedAlwaysAs("CAST(dateOfBirth AS DATE)")
     .build())
  .addColumn("ssn", "STRING")
  .addColumn("salary", "INT")
  .execute()

Generated columns are stored as if they were normal columns. That is, they occupy storage.

The following restrictions apply to generated columns:

A generation expression can use any SQL functions in Spark that always return the same result when given the same argument values, except the following types of functions:
- User-defined functions.
- Aggregate functions.
- Window functions.
- Functions returning multiple rows.

Delta Lake can generate partition filters for a query whenever a partition column is defined by one of the following expressions:

Note

Photon is required in Databricks Runtime 10.4 LTS and below. Photon is not required in Databricks Runtime 11.3 LTS and above.

CAST(col AS DATE) and the type of col is TIMESTAMP.
YEAR(col) and the type of col is TIMESTAMP.
Two partition columns defined by YEAR(col), MONTH(col) and the type of col is TIMESTAMP.
Three partition columns defined by YEAR(col), MONTH(col), DAY(col) and the type of col is TIMESTAMP.
Four partition columns defined by YEAR(col), MONTH(col), DAY(col), HOUR(col) and the type of col is TIMESTAMP.
SUBSTRING(col, pos, len) and the type of col is STRING
DATE_FORMAT(col, format) and the type of col is TIMESTAMP.
- You can only use date formats with the following patterns: yyyy-MM and yyyy-MM-dd-HH.
- In Databricks Runtime 10.4 LTS and above, you can also use the following pattern: yyyy-MM-dd.

If a partition column is defined by one of the preceding expressions, and a query filters data using the underlying base column of a generation expression, Delta Lake looks at the relationship between the base column and the generated column, and populates partition filters based on the generated partition column if possible. For example, given the following table:

CREATE TABLE events(
eventId BIGINT,
data STRING,
eventType STRING,
eventTime TIMESTAMP,
eventDate date GENERATED ALWAYS AS (CAST(eventTime AS DATE))
)
PARTITIONED BY (eventType, eventDate)

If you then run the following query:

SELECT * FROM events
WHERE eventTime >= "2020-10-01 00:00:00" <= "2020-10-01 12:00:00"

Delta Lake automatically generates a partition filter so that the preceding query only reads the data in partition date=2020-10-01 even if a partition filter is not specified.

As another example, given the following table:

CREATE TABLE events(
eventId BIGINT,
data STRING,
eventType STRING,
eventTime TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(eventTime)),
month INT GENERATED ALWAYS AS (MONTH(eventTime)),
day INT GENERATED ALWAYS AS (DAY(eventTime))
)
PARTITIONED BY (eventType, year, month, day)

If you then run the following query:

SELECT * FROM events
WHERE eventTime >= "2020-10-01 00:00:00" <= "2020-10-01 12:00:00"

Delta Lake automatically generates a partition filter so that the preceding query only reads the data in partition year=2020/month=10/day=01 even if a partition filter is not specified.

You can use an EXPLAIN clause and check the provided plan to see whether Delta Lake automatically generates any partition filters.

Use identity columns in Delta Lake

Important

Declaring an identity column on a Delta table disables concurrent transactions. Only use identity columns in use cases where concurrent writes to the target table are not required.

Delta Lake identity columns are a type of generated column that assigns unique values for each record inserted into a table. The following example shows the basic syntax for declaring an identity column during a create table statement:

CREATE TABLE table_name (
  id_col1 BIGINT GENERATED ALWAYS AS IDENTITY,
  id_col2 BIGINT GENERATED ALWAYS AS IDENTITY (START WITH -1 INCREMENT BY 1),
  id_col3 BIGINT GENERATED BY DEFAULT AS IDENTITY,
  id_col4 BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH -1 INCREMENT BY 1)
 )

from delta.tables import DeltaTable, IdentityGenerator
from pyspark.sql.types import LongType

DeltaTable.create()
  .tableName("table_name")
  .addColumn("id_col1", dataType=LongType(), generatedAlwaysAs=IdentityGenerator())
  .addColumn("id_col2", dataType=LongType(), generatedAlwaysAs=IdentityGenerator(start=-1, step=1))
  .addColumn("id_col3", dataType=LongType(), generatedByDefaultAs=IdentityGenerator())
  .addColumn("id_col4", dataType=LongType(), generatedByDefaultAs=IdentityGenerator(start=-1, step=1))
  .execute()

import io.delta.tables.DeltaTable
import org.apache.spark.sql.types.LongType

DeltaTable.create(spark)
  .tableName("table_name")
  .addColumn(
    DeltaTable.columnBuilder(spark, "id_col1")
      .dataType(LongType)
      .generatedAlwaysAsIdentity().build())
  .addColumn(
    DeltaTable.columnBuilder(spark, "id_col2")
      .dataType(LongType)
      .generatedAlwaysAsIdentity(start = -1L, step = 1L).build())
  .addColumn(
    DeltaTable.columnBuilder(spark, "id_col3")
      .dataType(LongType)
      .generatedByDefaultAsIdentity().build())
  .addColumn(
    DeltaTable.columnBuilder(spark, "id_col4")
      .dataType(LongType)
      .generatedByDefaultAsIdentity(start = -1L, step = 1L).build())
  .execute()

Note

Scala and Python APIs for identity columns are available in Databricks Runtime 16.0 and above.

To see all SQL syntax options for creating tables with identity columns, see CREATE TABLE [USING].

You can optionally specify the following:

A starting value.
A step size, which can be positive or negative.

Both the starting value and step size default to 1. You cannot specify a step size of 0.

Values assigned by identity columns are unique and increment in the direction of the specified step, and in multiples of the specified step size, but are not guaranteed to be contiguous. For example, with a starting value of 0 and a step size of 2, all values are positive even numbers but some even numbers might be skipped.

When using the clause GENERATED BY DEFAULT AS IDENTITY, insert operations can specify values for the identity column. Modify the clause to be GENERATED ALWAYS AS IDENTITY to override the ability to manually set values.

Identity columns only support the BIGINT type, and operations fail if the assigned value exceeds the range supported by BIGINT.

To learn about syncing identity column values with data, see ALTER TABLE … COLUMN clause.

CTAS and identity columns

You cannot define schema, identity column constraints, or any other table specifications when using a CREATE TABLE table_name AS SELECT (CTAS) statement.

To create a new table with an identity column and populate it with existing data, do the following:

Create a table with the correct schema, including the identity column definition and other table properties.
Run an INSERT operation.

The following example uses the DEFAULT keyword to define the identity column. If data inserted into the table includes valid values for the identity column, these values are used.

CREATE OR REPLACE TABLE new_table (
  id BIGINT GENERATED BY DEFAULT AS IDENTITY (START WITH 5),
  event_date DATE,
  some_value BIGINT
);

-- Inserts records including existing IDs
INSERT INTO new_table
SELECT id, event_date, some_value FROM old_table;

-- Insert records and generate new IDs
INSERT INTO new_table
SELECT event_date, some_value FROM new_records;

Identity column limitations

The following limitations exist when working with identity columns:

Concurrent transactions are not supported on tables with identity columns enabled.
You cannot partition a table by an identity column.
You cannot use ALTER TABLE to ADD, REPLACE, or CHANGE an identity column.
You cannot update the value of an identity column for an existing record.

Note

To change the IDENTITY value for an existing record, you must delete the record and INSERT it as a new record.