Partition discovery for external tables

This article describes the default partition discovery strategy for Unity Catalog external tables and an optional setting to enable a partition metadata log that makes partition discovery consistent with Hive metastore.

Databricks recommends enabling partition metadata logging for improved read speeds and query performance for Unity Catalog external tables with partitions.

What is the default partition discovery strategy for Unity Catalog?

By default, Unity Catalog recursively lists all directories in the table location to automatically discover partitions. For large tables with many partition directories, this can increase latency for many table operations.

Use partition metadata logging

Preview

This feature is in Public Preview.

In Databricks Runtime 13.3 LTS and above, you can optionally enable partition metadata logging, which is a partition discovery strategy for external tables registered to Unity Catalog. This behavior is consistent with the partition discovery strategy used in Hive metastore. This behavior only impacts Unity Catalog external tables that have partitions and use Parquet, ORC, CSV, or JSON. Databricks recommends enabling the new behavior for improved read speeds and query performance for these tables.

Important

Tables with partition metadata logging enabled demonstrate a behavioral change for partition discovery. Instead of automatically scanning the table location for partitions, Unity Catalog only respects partitions registered in the partition metadata. See Manually add, drop, or repair partition metadata.

This behavior will become the default in a future Databricks Runtime version. Tables with this feature enabled can only be read or written using Databricks Runtime 13.3 LTS and above.

Note

You must try to query a table with Databricks Runtime 12.2 LTS or below to confirm it does not use the new partition log behavior.

Enable partition metadata logging

To enable partition metadata logging on a table, you must enable a Spark conf for your current SparkSession and then create an external table. This setting is only required in the SparkSession that creates the table. Once you create a table with partition metadata logging enabled, it persists this setting as part of the table metadata and uses the feature in all subsequent workloads.

The following syntax demonstrates using SQL to set a Spark conf in a notebook. You can also set Spark configurations when configuring compute.

SET spark.databricks.nonDelta.partitionLog.enabled = true;

Important

You can only read and write tables with partition metadata logging enabled in Databricks Runtime 13.3 LTS and above. If you have workloads that run on Databricks Runtime 12.2 LTS or below that must interact with tables, do not use this setting.

External tables do not delete underlying data files when you drop them. Databricks recommends using CREATE OR REPLACE syntax to upgrade tables to use partition metadata logging, as in the following example:

CREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>
USING <format>
PARTITIONED BY (<partition-column-list>)
LOCATION 'gs://<bucket-path>/<table-directory>';

Unity Catalog enforces rules on path overlaps for tables and volumes. You cannot register a new Unity Catalog table on a collection of data files if a table already exists in that location.

Work with tables with partition metadata

Databricks recommends using table names in all reads and writes against all tables registered to Unity Catalog. For tables with partition metadata, this guarantees that new partitions added to a table register to Unity Catalog and that queries against the table read all registered partitions.

Using path-based patterns for reads or writes can result in partitions being ignored or not registered to the Unity Catalog metastore. See Limitations.

List partitions

Use the following command to show all partitions registered to Unity Catalog as partition metadata:

SHOW PARTITIONS <table-name>

To check if a single partition is registered to Unity Catalog, use the following command:

SHOW PARTITIONS <table-name>
PARTITION (<partition-column-name> = <partition-column-value>)

Manually add, drop, or repair partition metadata

Unity Catalog requires that all partitions for external tables are contained within the directory registered using the LOCATION clause during table registration.

With partition metadata enabled, automatic discovery of partitions in the table location is disabled. If external systems write data to the table location or you use path-based writes to add or overwrite records in your table, you must manually repair the partition metadata.

Databricks uses Hive-style partitioning for storing tables backed by Parquet, ORC, CSV, and JSON. Hive-style partitions contain key-value pairs connected by an equal-sign in the partition directory, for example year=2021/month=01/.

If your table uses Hive-style partitioning, you can use MSCK REPAIR to sync partition metadata in Unity Catalog with partitions that exist in the table location. The following syntax examples demonstrate common operations:

-- Add and remove parition metadata to match directories in table location
MSCK REPAIR TABLE <table_name>  SYNC PARTITIONS;

-- Add partitions in the table location that are not registered as partition metadata
MSCK REPAIR TABLE <table_name> ADD PARTITIONS;

-- Drop partitions registered as partition metadata that are not in the table location
MSCK REPAIR TABLE <table_name> DROP PARTITIONS;

See REPAIR TABLE.

Manually specify paths for other partition types

If your table does not use Hive-style partitioning, you must manually specify partition locations when adding partitions. Manually specifying partitions can also reduce latency compared to MSCK REPAIR syntax, especially for tables with a large number of partitions. The following syntax example shows adding a partition:

ALTER TABLE <table-name>
ADD PARTITION (<partition-column-name> = <partition-column-value>)
LOCATION 'gs://<bucket-path>/<table-directory>/<partition-directory>';

You can also use ALTER TABLE syntax to drop, rename, recover, and set locations for partitions. See ALTER TABLE … PARTITION.

Disable new partition metadata

The Spark conf that controls whether new tables use partition metadata is disabled by default. You can also explicitly disable this behavior. The following syntax uses SQL to disable the Spark conf:

SET spark.databricks.nonDelta.partitionLog.enabled = false;

This only controls whether or not tables created in the SparkSession use partition metadata. To disable partition metadata on a table that uses the behavior, you must drop and recreate the table in a SparkSession that does not have the Spark conf enabled.

Note

While you cannot read or write to tables with partition metadata enabled in Databricks Runtime 12.2 LTS or below, you can run DROP or CREATE OR REPLACE TABLE statements against these tables if you have sufficient privileges in Unity Catalog.

Limitations

The following limitations exist:

  • You cannot read or write to tables with partition metadata enabled using Databricks Runtime 12.2 LTS or below.

  • Reading a table using the directory path returns all partitions including any that have been manually added or dropped.

  • If you insert or overwrite records in a table using a path instead of a table name, the partition metadata is not recorded.

  • Avro file format is not supported.