Delta Lake limitations on S3

This article details some of the limitations you might encounter while working with data stored in S3 with Delta Lake on Databricks. The eventually consistent model used in Amazon S3 can lead to potential problems when multiple systems or clusters modify data in the same table simultaneously.

Databricks and Delta Lake support multi-cluster writes by default, meaning that queries writing to a table from multiple clusters at the same time won’t corrupt the table. For Delta tables stored on S3, this guarantee is limited to a single Databricks workspace.

Warning

To avoid potential data corruption and data loss issues, Databricks recommends you do not modify the same Delta table stored in S3 from different workspaces.

Bucket versioning and Delta Lake

You can use S3 bucket versioning to provide additional redundancy for data stored with Delta Lake. Databricks recommends retaining three versions and implementing a lifecycle management policy that retains versions for 7 days or less for all S3 buckets with versioning enabled.

Important

If you encounter performance slowdown on tables stored in buckets with versioning enabled, please indicate that bucket versioning is enabled while contacting Databricks support.

What are the limitations of multi-cluster writes on S3?

The following features are not supported when running in this mode:

You can disable multi-cluster writes by setting spark.databricks.delta.multiClusterWrites.enabled to false. If they are disabled, writes to a single table must originate from a single cluster.

Warning

Disabling spark.databricks.delta.multiClusterWrites.enabled and modifying the same Delta table from multiple clusters concurrently can lead to data loss or data corruption.

Why is Delta Lake data I deleted still stored in S3?

If you are using Delta Lake and you have enabled bucket versioning on the S3 bucket, you have two entities managing table files. Databricks recommends disabling bucket versioning so that the VACUUM command can effectively remove unused data files.

Why does a table show old data after I delete Delta Lake files with rm -rf and create a new table in the same location?

Deletes on S3 are only eventually consistent. Thus after deleting a table old versions of the transaction log may still be visible for a while. To avoid this, do not reuse a table path after deleting it. Instead we recommend that you use transactional mechanisms like DELETE FROM, overwrite, and overwriteSchema to delete and update tables. See Best practice to replace a table.