Managed vs. external volumes
This article discusses the differences between managed volumes and external volumes and the reasons why you might choose to use external volumes. Databricks recommends managed volumes as the simplest solution for storing and managing access to non-tabular data.
For more guidance on using Unity Catalog to configure access to cloud object storage, see Connect to cloud object storage using Unity Catalog.
Behavior differences between managed and external volumes
Managed and external volumes provide nearly identical experiences when using Databricks tools, UIs, and APIs. The following are the differences between these volume types.
Managed volumes provide a fully-managed storage experience. This means the following:
All interactions with files in managed volumes must go through Unity Catalog.
Directory naming and data layout is managed by Unity Catalog. Directory names include hashes to avoid conflicts in underlying cloud object storage accounts.
When you drop a managed volume, Databricks deletes the underlying data within 30 days.
External volumes bring data governance to cloud object storage. This means the following:
You can use cloud URIs in Databricks or external systems to interact with files in external volumes.
All directories created within an external volume or files uploaded are relative to the
LOCATION
specified at creation.When you drop an external volume, you remove the volume from Unity Catalog but the underlying data remains unchanged in the external location.
Why use external volumes?
External volumes allow you to add Unity Catalog data governance to existing cloud object storage directories. Some use cases for external volumes include the following:
Adding governance to data files without migration.
Governing files produced by other systems that must be ingested or accessed by Databricks.
Governing data produced by Databricks that must be accessed directly from cloud object storage by other systems.
Databricks recommends using external volumes to store non-tabular data files that are read or written by external systems in addition to Databricks. Unity Catalog does not govern reads and writes performed directly against cloud object storage from external systems, so you must configure additional policies and credentials in your cloud account to ensure that data governance policies are respected outside Databricks.