Compact data files with optimize on Delta Lake

See OPTIMIZE.

Delta Lake on Databricks can improve the speed of read queries from a table. One way to improve this speed is to coalesce small files into larger ones.

Syntax examples

You trigger compaction by running the OPTIMIZE command:

OPTIMIZE delta.`/data/events`
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, "/data/events")
deltaTable.optimize().executeCompaction()
import io.delta.tables._
val deltaTable = DeltaTable.forPath(spark, "/data/events")
deltaTable.optimize().executeCompaction()

or, alternately:

OPTIMIZE events
from delta.tables import *
deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().executeCompaction()
import io.delta.tables._
val deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().executeCompaction()

If you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition predicate using WHERE:

OPTIMIZE events WHERE date >= '2022-11-18'
from delta.tables import *
deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().where("date='2021-11-18'").executeCompaction()
import io.delta.tables._
val deltaTable = DeltaTable.forName(spark, "events")
deltaTable.optimize().where("date='2021-11-18'").executeCompaction()

Note

  • Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no effect.

  • Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. However, the two measures are most often correlated.

  • Python and Scala APIs for executing OPTIMIZE operation are available from Databricks Runtime 11.0 and above.

Readers of Delta tables use snapshot isolation, which means that they are not interrupted when OPTIMIZE removes unnecessary files from the transaction log. OPTIMIZE makes no data related changes to the table, so a read before and after an OPTIMIZE has the same results. Performing OPTIMIZE on a table that is a streaming source does not affect any current or future streams that treat this table as a source. OPTIMIZE returns the file statistics (min, max, total, and so on) for the files removed and the files added by the operation. Optimize stats also contains the Z-Ordering statistics, the number of batches, and partitions optimized.

You can also compact small files automatically using Auto optimize on Databricks.