Use Delta Lake generated columns

Preview

This feature is in Public Preview.

This feature is available on Databricks Runtime 8.3 and above.

Delta Lake supports generated columns which are a special type of column whose values are automatically generated based on a user-specified function over other columns in the Delta table. When you write to a table with generated columns and you do not explicitly provide values for them, Delta Lake automatically computes the values. For example, you can automatically generate a date column (for partitioning the table by date) from the timestamp column; any writes into the table need only specify the data for the timestamp column. However, if you explicitly provide values for them, the values must satisfy the constraint (<value> <=> <generation expression>) IS TRUE or the write will fail with an error.

Important

Tables created with generated columns have a higher table writer protocol version than the default. See Table protocol versioning to understand table protocol versioning and what it means to have a higher version of a table protocol version.

The following example shows how to create a table with generated columns:

CREATE TABLE default.people10m (
  id INT,
  firstName STRING,
  middleName STRING,
  lastName STRING,
  gender STRING,
  birthDate TIMESTAMP,
  dateOfBirth DATE GENERATED ALWAYS AS (CAST(birthDate AS DATE)),
  ssn STRING,
  salary INT
)
DeltaTable.create(spark) \
  .tableName("default.people10m") \
  .addColumn("id", "INT") \
  .addColumn("firstName", "STRING") \
  .addColumn("middleName", "STRING") \
  .addColumn("lastName", "STRING", comment = "surname") \
  .addColumn("gender", "STRING") \
  .addColumn("birthDate", "TIMESTAMP") \
  .addColumn("dateOfBirth", DateType(), generatedAlwaysAs="CAST(birthDate AS DATE)") \
  .addColumn("ssn", "STRING") \
  .addColumn("salary", "INT") \
  .execute()
DeltaTable.create(spark)
  .tableName("default.people10m")
  .addColumn("id", "INT")
  .addColumn("firstName", "STRING")
  .addColumn("middleName", "STRING")
  .addColumn(
    DeltaTable.columnBuilder("lastName")
      .dataType("STRING")
      .comment("surname")
      .build())
  .addColumn("lastName", "STRING", comment = "surname")
  .addColumn("gender", "STRING")
  .addColumn("birthDate", "TIMESTAMP")
  .addColumn(
    DeltaTable.columnBuilder("dateOfBirth")
     .dataType(DateType)
     .generatedAlwaysAs("CAST(dateOfBirth AS DATE)")
     .build())
  .addColumn("ssn", "STRING")
  .addColumn("salary", "INT")
  .execute()

Generated columns are stored as if they were normal columns. That is, they occupy storage.

The following restrictions apply to generated columns:

  • A generation expression can use any SQL functions in Spark that always return the same result when given the same argument values, except the following types of functions:

    • User-defined functions.

    • Aggregate functions.

    • Window functions.

    • Functions returning multiple rows.

  • For Databricks Runtime 9.1 and above, MERGE operations support generated columns when you set spark.databricks.delta.schema.autoMerge.enabled to true.

In Databricks Runtime 8.4 and above with Photon support, Delta Lake can generate partition filters for a query whenever a partition column is defined by one of the following expressions:

  • CAST(col AS DATE) and the type of col is TIMESTAMP.

  • YEAR(col) and the type of col is TIMESTAMP.

  • Two partition columns defined by YEAR(col), MONTH(col) and the type of col is TIMESTAMP.

  • Three partition columns defined by YEAR(col), MONTH(col), DAY(col) and the type of col is TIMESTAMP.

  • Four partition columns defined by YEAR(col), MONTH(col), DAY(col), HOUR(col) and the type of col is TIMESTAMP.

  • SUBSTRING(col, pos, len) and the type of col is STRING

  • DATE_FORMAT(col, format) and the type of col is TIMESTAMP.

If a partition column is defined by one of the preceding expressions, and a query filters data using the underlying base column of a generation expression, Delta Lake looks at the relationship between the base column and the generated column, and populates partition filters based on the generated partition column if possible. For example, given the following table:

CREATE TABLE events(
eventId BIGINT,
data STRING,
eventType STRING,
eventTime TIMESTAMP,
eventDate date GENERATED ALWAYS AS (CAST(eventTime AS DATE))
)
PARTITIONED BY (eventType, eventDate)

If you then run the following query:

SELECT * FROM events
WHERE eventTime >= "2020-10-01 00:00:00" <= "2020-10-01 12:00:00"

Delta Lake automatically generates a partition filter so that the preceding query only reads the data in partition date=2020-10-01 even if a partition filter is not specified.

As another example, given the following table:

CREATE TABLE events(
eventId BIGINT,
data STRING,
eventType STRING,
eventTime TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(eventTime)),
month INT GENERATED ALWAYS AS (MONTH(eventTime)),
day INT GENERATED ALWAYS AS (DAY(eventTime))
)
PARTITIONED BY (eventType, year, month, day)

If you then run the following query:

SELECT * FROM events
WHERE eventTime >= "2020-10-01 00:00:00" <= "2020-10-01 12:00:00"

Delta Lake automatically generates a partition filter so that the preceding query only reads the data in partition year=2020/month=10/day=01 even if a partition filter is not specified.

You can use an EXPLAIN clause and check the provided plan to see whether Delta Lake automatically generates any partition filters.