Remove unused data files with vacuum

You can remove files no longer referenced by a Delta table and are older than the retention threshold by running the vacuum command on the table. vacuum is not triggered automatically. The default retention threshold for the files is 7 days. To change this behavior, see Configure data retention for time travel.

Important

  • vacuum removes all files from directories not managed by Delta Lake, ignoring directories beginning with _. If you are storing additional metadata like Structured Streaming checkpoints within a Delta table directory, use a directory name such as _checkpoints.

  • vacuum deletes only data files, not log files. Log files are deleted automatically and asynchronously after checkpoint operations. The default retention period of log files is 30 days, configurable through the delta.logRetentionDuration property which you set with the ALTER TABLE SET TBLPROPERTIES SQL method. See Delta table properties reference.

  • The ability to time travel back to a version older than the retention period is lost after running vacuum.

Note

When disk caching is enabled, a cluster might contain data from Parquet files that have been deleted with vacuum. Therefore, it may be possible to query the data of previous table versions whose files have been deleted. Restarting the cluster will remove the cached data. See Configure the disk cache.

Example syntax for vacuum

VACUUM eventsTable   -- vacuum files not required by versions older than the default retention period

VACUUM '/data/events' -- vacuum files in path-based table

VACUUM delta.`/data/events/`

VACUUM delta.`/data/events/` RETAIN 100 HOURS  -- vacuum files not required by versions more than 100 hours old

VACUUM eventsTable DRY RUN    -- do dry run to get the list of files to be deleted

For Spark SQL syntax details, see VACUUM.

See the Delta Lake API documentation for Scala, Java, and Python syntax details.

How frequently should you run vacuum?

Databricks recommends regularly running VACUUM on all tables to reduce excess cloud data storage costs. The default retention threshold for vacuum is 7 days. Setting a higher threshold gives you access to a greater history for your table, but increases the number of data files stored and, as a result, incurs greater storage costs from your cloud provider.

Why can’t you vacuum a Delta table with a low retention threshold?

Warning

It is recommended that you set a retention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. If VACUUM cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when VACUUM deletes files that have not yet been committed. You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.

Delta Lake has a safety check to prevent you from running a dangerous VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false.

Audit information

VACUUM commits to the Delta transaction log contain audit information. You can query the audit events using DESCRIBE HISTORY.