Data skipping information is collected automatically when you write data into a Delta table. Delta Lake on Databricks takes advantage of this information (minimum and maximum values) at query time to provide faster queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data. For best results, apply Z-Ordering.
In Databricks Runtime 13.3 and above, Databricks recommends using clustering for Delta table layout. Clustering is not compatible with Z-ordering. See Use liquid clustering for Delta tables.
By default Delta Lake on Databricks collects statistics on the first 32 columns defined in your table schema. You can change this value using the table property
delta.dataSkippingNumIndexedCols. Adding more columns to collect statistics would add more overhead as you write files.
Collecting statistics on long strings is an expensive operation. To avoid collecting statistics on long strings, you can either configure the table property
delta.dataSkippingNumIndexedCols to avoid columns containing long strings or move columns containing long strings to a column greater than
ALTER TABLE ALTER COLUMN. See ALTER TABLE
For the purposes of collecting statistics, each field within a nested column is considered as an individual column.
Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the
ZORDER BY clause:
OPTIMIZE events WHERE date >= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)
If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use
You can specify multiple columns for
ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each extra column. Z-ordering on columns that do not have statistics collected on them would be ineffective and a waste of resources. This is because data skipping requires column-local stats such as min, max, and count. You can configure statistics collection on certain columns by reordering columns in the schema, or you can increase the number of columns to collect statistics on.
Z-ordering is not idempotent but aims to be an incremental operation. The time it takes for Z-ordering is not guaranteed to reduce over multiple runs. However, if no new data was added to a partition that was just Z-ordered, another Z-ordering of that partition will not have any effect.
Z-ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times.
For example, if you
ZORDER BYdate and your most recent records are all much wider (for example longer arrays or string values) than the ones in the past, it is expected that the
OPTIMIZEjob’s task durations will be skewed, as well as the resulting file sizes. This is, however, only a problem for the
OPTIMIZEcommand itself; it should not have any negative impact on subsequent queries.