Update Delta Lake table schema

Delta Lake lets you update the schema of a table. The following types of changes are supported:

  • Adding new columns (at arbitrary positions)

  • Reordering existing columns

  • Renaming existing columns

You can make these changes explicitly using DDL or implicitly using DML.

Important

When you update a Delta table schema, streams that read from that table terminate. If you want the stream to continue you must restart it.

For recommended methods, see Production considerations for Structured Streaming.

Explicitly update schema to add columns

ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)

By default, nullability is true.

To add a column to a nested field, use:

ALTER TABLE table_name ADD COLUMNS (col_name.nested_col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)

For example, if the schema before running ALTER TABLE boxes ADD COLUMNS (colB.nested STRING AFTER field1) is:

- root
| - colA
| - colB
| +-field1
| +-field2

the schema after is:

- root
| - colA
| - colB
| +-field1
| +-nested
| +-field2

Note

Adding nested columns is supported only for structs. Arrays and maps are not supported.

Explicitly update schema to change column comment or ordering

ALTER TABLE table_name ALTER [COLUMN] col_name col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name]

To change a column in a nested field, use:

ALTER TABLE table_name ALTER [COLUMN] col_name.nested_col_name nested_col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name]

For example, if the schema before running ALTER TABLE boxes CHANGE COLUMN colB.field2 field2 STRING FIRST is:

- root
| - colA
| - colB
| +-field1
| +-field2

the schema after is:

- root
| - colA
| - colB
| +-field2
| +-field1

Explicitly update schema to replace columns

ALTER TABLE table_name REPLACE COLUMNS (col_name1 col_type1 [COMMENT col_comment1], ...)

For example, when running the following DDL:

ALTER TABLE boxes REPLACE COLUMNS (colC STRING, colB STRUCT<field2:STRING, nested:STRING, field1:STRING>, colA STRING)

if the schema before is:

- root
| - colA
| - colB
| +-field1
| +-field2

the schema after is:

- root
| - colC
| - colB
| +-field2
| +-nested
| +-field1
| - colA

Explicitly update schema to rename columns

Preview

This feature is in Public Preview.

Note

This feature is available in Databricks Runtime 10.2 and above.

To rename columns without rewriting any of the columns’ existing data, you must enable column mapping for the table. See Column mapping on Databricks.

To rename a column:

ALTER TABLE table_name RENAME COLUMN old_col_name TO new_col_name

To rename a nested field:

ALTER TABLE table_name RENAME COLUMN col_name.old_nested_field TO new_nested_field

For example, when you run the following command:

ALTER TABLE boxes RENAME COLUMN colB.field1 TO field001

If the schema before is:

- root
| - colA
| - colB
| +-field1
| +-field2

Then the schema after is:

- root
| - colA
| - colB
| +-field001
| +-field2

See Column mapping on Databricks.

Explicitly update schema to drop columns

Preview

This feature is in Public Preview.

Note

This feature is available in Databricks Runtime 11.0 and above.

To drop columns as a metadata-only operation without rewriting any data files, you must enable column mapping for the table. See Column mapping on Databricks.

Important

Dropping a column from metadata does not delete the underlying data for the column in files. To purge the dropped column data, you can use REORG TABLE to rewrite files. You can then use VACUUM to physically delete the files that contain the dropped column data.

To drop a column:

ALTER TABLE table_name DROP COLUMN col_name

To drop multiple columns:

ALTER TABLE table_name DROP COLUMNS (col_name_1, col_name_2)

Explicitly update schema to change column type or name

You can change a column’s type or name or drop a column by rewriting the table. To do this, use the overwriteSchema option.

The following example shows changing a column type:

(spark.read.table(...)
  .withColumn("birthDate", col("birthDate").cast("date"))
  .write
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .saveAsTable(...)
)

The following example shows changing a column name:

(spark.read.table(...)
  .withColumnRenamed("dateOfBirth", "birthDate")
  .write
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .saveAsTable(...)
)

Add columns with automatic schema update

Columns that are present in the DataFrame but missing from the table are automatically added as part of a write transaction when:

  • write or writeStream have .option("mergeSchema", "true")

  • spark.databricks.delta.schema.autoMerge.enabled is true

When both options are specified, the option from the DataFrameWriter takes precedence. The added columns are appended to the end of the struct they are present in. Case is preserved when appending a new column.

Note

  • mergeSchema cannot be used with INSERT INTO or .write.insertInto().

Automatic schema evolution for Delta Lake merge

By default, updateAll and insertAll assign all the columns in the target Delta table with columns of the same name from the source dataset. Any columns in the source dataset that don’t match columns in the target table are ignored. However, in some use cases, it is desirable to automatically add source columns to the target Delta table.

To automatically update the table schema during a merge operation with updateAll and insertAll (at least one of them), you can set the Spark session configuration spark.databricks.delta.schema.autoMerge.enabled to true before running the merge operation.

Note

  • Schema evolution occurs only when there is either an updateAll (UPDATE SET *) or an insertAll (INSERT *) action, or both.

  • update and insert actions cannot explicitly refer to target columns that do not already exist in the target table (even it there are updateAll or insertAll as one of the clauses). See the examples below.

Note

In Databricks Runtime 7.4 and below, merge supports schema evolution of only top-level columns, and not of nested columns.

Here are a few examples of the effects of merge operation with and without schema evolution.

Columns

Query (in Scala)

Behavior without schema evolution (default)

Behavior with schema evolution

Target columns: key, value

Source columns: key, value, newValue

targetDeltaTable.alias("t")
  .merge(
    sourceDataFrame.alias("s"),
    "t.key = s.key")
  .whenMatched().updateAll()
  .whenNotMatched().insertAll()
  .execute()

The table schema remains unchanged; only columns key, value are updated/inserted.

The table schema is changed to (key, value, newValue). updateAll updates columns value and newValue, and insertAll inserts rows (key, value, newValue).

Target columns: key, oldValue

Source columns: key, newValue

targetDeltaTable.alias("t")
  .merge(
    sourceDataFrame.alias("s"),
    "t.key = s.key")
  .whenMatched().updateAll()
  .whenNotMatched().insertAll()
  .execute()

updateAll and insertAll actions throw an error because the target column oldValue is not in the source.

The table schema is changed to (key, oldValue, newValue). updateAll updates columns key and newValue leaving oldValue unchanged, and insertAll inserts rows (key, NULL, newValue) (that is, oldValue is inserted as NULL).

Target columns: key, oldValue

Source columns: key, newValue

targetDeltaTable.alias("t")
  .merge(
    sourceDataFrame.alias("s"),
    "t.key = s.key")
  .whenMatched().update(Map(
    "newValue" -> col("s.newValue")))
  .whenNotMatched().insertAll()
  .execute()

update throws an error because column newValue does not exist in the target table.

update still throws an error because column newValue does not exist in the target table.

Target columns: key, oldValue

Source columns: key, newValue

targetDeltaTable.alias("t")
  .merge(
    sourceDataFrame.alias("s"),
    "t.key = s.key")
  .whenMatched().updateAll()
  .whenNotMatched().insert(Map(
    "key" -> col("s.key"),
    "newValue" -> col("s.newValue")))
  .execute()

insert throws an error because column newValue does not exist in the target table.

insert still throws an error as column newValue does not exist in the target table.

Automatic schema evolution for arrays of structs

Delta MERGE INTO supports resolving struct fields by name and evolving schemas for arrays of structs. With schema evolution enabled, target table schemas will evolve for arrays of structs, which also works with any nested structs inside of arrays.

Note

This feature is available in Databricks Runtime 9.1 and above. For Databricks Runtime 9.0 and below, implicit Spark casting is used for arrays of structs to resolve struct fields by position, and the effects of merge operations with and without schema evolution of structs in arrays are inconsistent with the behaviors of structs outside of arrays.

Here are a few examples of the effects of merge operations with and without schema evolution for arrays of structs.

Source schema

Target schema

Behavior without schema evolution (default)

Behavior with schema evolution

array<struct<b: string, a: string>>

array<struct<a: int, b: int>>

The table schema remains unchanged. Columns will be resolved by name and updated or inserted.

The table schema remains unchanged. Columns will be resolved by name and updated or inserted.

array<struct<a: int, c: string, d: string>>

array<struct<a: string, b: string>>

update and insert throw errors because c and d do not exist in the target table.

The table schema is changed to array<struct<a: string, b: string, c: string, d: string>>. c and d are inserted as NULL for existing entries in the target table. update and insert fill entries in the source table with a casted to string and b as NULL.

array<struct<a: string, b: struct<c: string, d: string>>>

array<struct<a: string, b: struct<c: string>>>

update and insert throw errors because d does not exist in the target table.

The target table schema is changed to array<struct<a: string, b: struct<c: string, d: string>>>. d is inserted as NULL for existing entries in the target table.

Dealing with NullType columns in schema updates

Because Parquet doesn’t support NullType, NullType columns are dropped from the DataFrame when writing into Delta tables, but are still stored in the schema. When a different data type is received for that column, Delta Lake merges the schema to the new data type. If Delta Lake receives a NullType for an existing column, the old schema is retained and the new column is dropped during the write.

NullType in streaming is not supported. Since you must set schemas when using streaming this should be very rare. NullType is also not accepted for complex types such as ArrayType and MapType.

Replace table schema

By default, overwriting the data in a table does not overwrite the schema. When overwriting a table using mode("overwrite") without replaceWhere, you may still want to overwrite the schema of the data being written. You replace the schema and partitioning of the table by setting the overwriteSchema option to true:

df.write.option("overwriteSchema", "true")