Databricks File System (DBFS)

Note

The CLI feature is unavailable on Databricks on Google Cloud as of this release.

Note

DBFS access to the local file system (FUSE mount) is not available in this release. For DBFS access, the Databricks dbutils commands, Hadoop Filesystem APIs such as the %fs command, and Spark read and write APIs are available. Contact your Databricks representative for any questions.

Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:

  • Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
  • Allows you to interact with object storage using directory and file semantics instead of storage URLs.
  • Persists files to object storage, so you won’t lose data after you terminate a cluster.

Important information about DBFS permissions

All users have read and write access to the objects in object storage mounted to DBFS, with the exception of the DBFS root.

DBFS root

The default storage location in DBFS is known as the DBFS root. Several types of data are stored in the following DBFS root locations:

In a new workspace, the DBFS root has the following default folders:

DBFS root default folders

The DBFS root also contains data—including mount point metadata and credentials and certain types of logs—that is not visible and cannot be directly accessed.

Configuration and usage recommendations

The DBFS root is created during workspace creation. Databricks creates two buckets, one for the DBFS root and one for workspace system data.

Data written to mount point paths (/mnt) is stored outside of the DBFS root, for example in GCS buckets that are mounted as DBFS paths.

Even though the DBFS root is writeable, Databricks recommends that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data.

Special DBFS root locations

The following articles provide more detail on special DBFS root locations:

Browse DBFS using the UI

You can browse and search for DBFS objects using the DBFS file browser.

Note

An admin user must enable the DBFS browser interface before you can use it. See Manage the DBFS file browser.

  1. Click Data Icon Data in the sidebar.
  2. Click the DBFS button at the top of the page.

The browser displays DBFS objects in a hierarchy of vertical swimlanes. Select an object to expand the hierarchy. Use Prefix search in any swimlane to find a DBFS object.

Browse DBFS

You can also list DBFS objects using the Databricks file system utility (dbutils.fs) and Spark APIs. See Access DBFS.

Mount object storage to DBFS

Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system.

For information on how to mount Google Cloud Storage (GCS) buckets, see Google Cloud Storage.

Important

Nested mounts are not supported. For example, the following structure is not supported:

  • storage1 mounted as /mnt/storage1
  • storage2 mounted as /mnt/storage1/storage2

Databricks recommends creating separate mount entries for each storage object:

  • storage1 mounted as /mnt/storage1
  • storage2 mounted as /mnt/storage2

Access DBFS

Important

All users have read and write access to the objects in object storage mounted to DBFS, with the exception of the DBFS root. For more information, see Important information about DBFS permissions.

You can upload data to DBFS using the file upload interface, Databricks file system utility (dbutils.fs), and Spark APIs.

In a Databricks cluster you access DBFS objects using the Databricks file system utility or Spark APIs.

DBFS and local driver node paths

You can work with files on DBFS or on the local driver node of the cluster. You can access the file system using magic commands such as %fs or %sh. You can also use the Databricks file system utility (dbutils.fs).

Note

DBFS access to the local file system (FUSE mount) is not available in this release. For DBFS access, the Databricks dbutils commands, Hadoop Filesystem APIs such as the %fs command, and Spark read and write APIs are available. Contact your Databricks representative for any questions.

Access files on DBFS

The path to the default blog storage (root) is dbfs:/.

The default location for %fs and dbutils.fs is root. Thus, to read from or write to root or an external bucket:

%fs <command> /<path>
dbutils.fs.<command> ("/<path>/")
Examples
# Default location for %fs is root
%fs ls /tmp/
%fs mkdirs /tmp/my_cloud_dir
%fs cp /tmp/test_dbfs.txt /tmp/file_b.txt
# Default location for dbutils.fs is root
dbutils.fs.ls ("/tmp/")
dbutils.fs.put("/tmp/my_new_file", "This is a file in cloud storage.")

Access files on the local filesystem

Note

DBFS access to the local file system (FUSE mount) is not available in this release. For DBFS access, the Databricks dbutils commands, Hadoop Filesystem APIs such as the %fs command, and Spark read and write APIs are available. Contact your Databricks representative for any questions.

File upload interface

If you have small data files on your local machine that you want to analyze with Databricks, you can easily import them to Databricks File System (DBFS) using one of the two file upload interfaces: from the DBFS file browser or from a notebook.

Files are uploaded to the FileStore directory.

Upload data to DBFS from the file browser

Note

This feature is disabled by default. An administrator must enable the DBFS browser interface before you can use it. See Manage the DBFS file browser.

  1. Click Data Icon Data in the sidebar.

  2. Click the DBFS button at the top of the page.

  3. Click the Upload button at the top of the page.

  4. On the Upload Data to DBFS dialog, optionally select a target directory or enter a new one.

  5. In the Files box, drag and drop or use the file browser to select the local file to upload.

    Upload to DBFS from the browser

Uploaded files are accessible by everyone who has access to the workspace.

Upload data to DBFS from a notebook

Note

This feature is enabled by default. If an administrator has disabled this feature, you will not have the option to upload files.

To create a table using the UI, see Create a table using the UI.

To upload data for use in a notebook, follow these steps.

  1. Create a new notebook or open an existing one, then click File > Upload Data

    Upload data
  2. Select a target directory in DBFS to store the uploaded file. The target directory defaults to /shared_uploads/<your-email-address>/.

    Uploaded files are accessible by everyone who has access to the workspace.

  3. Either drag files onto the drop target or click Browse to locate files in your local filesystem.

    Select Files and Destination
  4. When you have finished uploading the files, click Next.

    If you’ve uploaded CSV, TSV, or JSON files, Databricks generates code showing how to load the data into a DataFrame.

    View Files and Sample Code

    To save the text to your clipboard, click Copy.

  5. Click Done to return to the notebook.

dbutils

dbutils.fs provides file-system-like commands to access files in DBFS. This section has several examples of how to write files to and read files from DBFS using dbutils.fs commands.

Tip

To access the help menu for DBFS, use the dbutils.fs.help() command.

Write files to and read files from the DBFS root as if it were a local filesystem

dbutils.fs.mkdirs("/foobar/")
dbutils.fs.put("/foobar/baz.txt", "Hello, World!")
dbutils.fs.head("/foobar/baz.txt")
dbutils.fs.rm("/foobar/baz.txt")

Use dbfs:/ to access a DBFS path

display(dbutils.fs.ls("dbfs:/foobar"))

Use %fs magic commands

Notebooks support a shorthand—%fs magic commands—for accessing the dbutils filesystem module. Most dbutils.fs commands are available using %fs magic commands.

# List the DBFS root

%fs ls

# Recursively remove the files under foobar

%fs rm -r foobar

# Overwrite the file "/mnt/my-file" with the string "Hello world!"

%fs put -f "/mnt/my-file" "Hello world!"

Spark APIs

When you’re using Spark APIs, you reference files with "/mnt/training/file.csv" or "dbfs:/mnt/training/file.csv". The following example writes the file foo.text to the DBFS /tmp directory.

df.write.text("/tmp/foo.txt")

Local file APIs

Note

DBFS access to the local file system (FUSE mount) is not available in this release. For DBFS access, the Databricks dbutils commands, Hadoop Filesystem APIs such as the %fs command, and Spark read and write APIs are available. Contact your Databricks representative for any questions.