Recommendations for files in volumes and workspace files

When you upload or save data or files to Databricks, you can choose to store these files using Unity Catalog volumes or workspace files. This article contains recommendations and requirements for using these locations. For more details on volumes and workspace files, see Create and work with volumes and What are workspace files?.

Databricks recommends using Unity Catalog volumes to store data, libraries, and build artifacts. Store notebooks, SQL queries, and code files as workspace files. You can configure workspace file directories as Git folders to sync with remote Git repositories. See Git integration with Databricks Git folders. Small data files used for test scenarios can also be stored as workspace files.

The tables below provide specific recommendations for files, depending on your type of file or feature needs.

Important

The Databricks File System (DBFS) is also available for file storage, but is not recommended, as all workspace users have access to files in DBFS. See DBFS.

File types

The following table provides storage recommendations for file types. Databricks supports many file formats beyond what are provided in this table as examples.

File type

Recommendation

Databricks objects, such as notebooks and queries

Store as workspace files

Structured data files, such as Parquet files and ORC files

Store in Unity Catalog volumes

Semi-structured data files, such as text files (.csv, .txt) and JSON files (.json)

Store in Unity Catalog volumes

Unstructured data files, such as image files (.png, .svg), audio files (.mp3), and document files (.pdf, .docx)

Store in Unity Catalog volumes

Raw data files used for adhoc or early data exploration

Store in Unity Catalog volumes

Operational data, such as log files

Store in Unity Catalog volumes

Large archive files, such as ZIP files (.zip)

Store in Unity Catalog volumes

Source code files, such as Python files (.py), Java files (.java), and Scala files (.scala)

Store as workspace files, if applicable, with other related objects, such as notebooks and queries.

Databricks recommends managing these files in a Git folder for version control and change tracking of these files.

Build artifacts and libraries, such as Python wheels (.whl) and JAR files (.jar)

Store in Unity Catalog volumes

Configuration files

Store configuration files needed across workspaces in Unity Catalog volumes, but store them as workspace files if they are project files in a Git folder.

Feature comparison

The following table compares the feature offerings of workspace files and Unity Catalog volumes.

Feature

Workspace files

Unity Catalog volumes

File access

Workspace files are only accessible to each other within the same workspace.

Files are globally accessible across workspaces.

Programmatic access

Files can be accessed using:

Files can be accessed using:

Databricks Asset Bundles

By default, all files in a bundle, which includes libraries and Databricks objects such as notebooks and queries, are deployed securely as workspace files. Permissions are defined in the bundle configuration.

Bundles can be customized to include libraries already in volumes when the libraries exceed the size limit of workspace files. See Databricks Asset Bundles library dependencies.

File permission level

Permissions are at the Git-folder level if the file is in a Git folder, otherwise permissions are set at the file level.

Permissions are at the volume level.

Permissions management

Permissions are managed by workspace ACLs and are limited to the containing workspace.

Metadata and permissions are managed by Unity Catalog. These permissions are applicable across all workspaces that have access to the catalog.

External storage mount

Does not support mounting external storage

Provides the option to point to pre-existing datasets on external storage by creating an external volume. See Create an external volume.

UDF support

Not supported

Writing from UDFs is supported using Volumes FUSE

File size

Store smaller files less than 500MB, such as source code files (.py, .md, .yml) needed alongside notebooks.

Store very large data files at limits determined by cloud service providers.

Upload & download

Support for upload and download up to 10MB.

Support for upload and download up to 5GB.

Table creation support

Tables cannot be created with workspace files as the location.

Tables can be created from files in a volume by running COPY INTO, Autoloader, or other options described in Ingest data into a Databricks lakehouse.

Directory structure & file paths

Files are organized in nested directories, each with its own permission model:

  • User home directories, one for each user and service principal in the workspace

  • Git folders

  • Shared

Files are organized in nested directories inside a volume

See How can you access data in Unity Catalog?.

File history

Use Git folder within workspaces to track file changes.

Audit logs are available.