How to work with files on Databricks
You can work with files on DBFS, the local driver node of the cluster, cloud object storage, external locations, and in Databricks Repos. You can integrate other systems, but many of these do not provide direct file access to Databricks.
This article focuses on understanding the differences between interacting with files stored in the ephemeral volume storage attached to a running cluster and files stored in the DBFS root. You can directly apply the concepts shown for the DBFS root to mounted cloud object storage, because the /mnt
directory is under the DBFS root. Most examples can also be applied to direct interactions with cloud object storage and external locations if you have the required privileges.
What is the root path for Databricks?
The root path on Databricks depends on the code executed.
The DBFS root is the root path for Spark and DBFS commands. These include:
Spark SQL
DataFrames
dbutils.fs
%fs
The block storage volume attached to the driver is the root path for code executed locally. This includes:
%sh
Most Python code (not PySpark)
Most Scala code (not Spark)
Note
If you are working in Databricks Repos, the root path for %sh
is your current repo directory. For more details, see Programmatically interact with workspace files.
Access files on the DBFS root
When using commands that default to the DBFS root, you can use the relative path or include dbfs:/
.
SELECT * FROM parquet.`<path>`;
SELECT * FROM parquet.`dbfs:/<path>`
df = spark.read.load("<path>")
df.write.save("<path>")
dbutils.fs.<command> ("<path>")
%fs <command> /<path>
When using commands that default to the driver volume, you must use /dbfs
before the path.
%sh <command> /dbfs/<path>/
import os
os.<command>('/dbfs/<path>')
Access files on the driver filesystem
When using commands that default to the driver storage, you can provide a relative or absolute path.
%sh <command> /<path>
import os
os.<command>('/<path>')
When using commands that default to the DBFS root, you must use file:/
.
dbutils.fs.<command> ("file:/<path>")
%fs <command> file:/<path>
Because these files live on the attached driver volumes and Spark is a distributed processing engine, not all operations can directly access data here. If you need to move data from the driver filesystem to DBFS, you can copy files using magic commands or the Databricks utilities.
dbutils.fs.cp ("file:/<path>", "dbfs:/<path>")
%sh cp /<path> /dbfs/<path>
%fs cp file:/<path> /<path>
Understand default locations with examples
The table and diagram summarize and illustrate the commands described in this section and when to use each syntax.
Note
Commands leveraging open source or driver-only execution use FUSE to access data in cloud object storage. Adding /dbfs
to the file path automatically uses the DBFS implementation of FUSE.
Command |
Default location |
To read from DBFS root |
To read from local filesystem |
---|---|---|---|
|
DBFS root |
Add |
|
|
Local driver node |
Add |
|
|
DBFS root |
Add |
|
|
Local driver node |
Add |
|
|
DBFS root |
Not supported |

# Default location for %fs is root
%fs ls /tmp/
%fs mkdirs /tmp/my_cloud_dir
%fs cp /tmp/test_dbfs.txt /tmp/file_b.txt
# Default location for dbutils.fs is root
dbutils.fs.ls ("/tmp/")
dbutils.fs.put("/tmp/my_new_file", "This is a file in cloud storage.")
# Default location for %sh is the local filesystem
%sh ls /dbfs/tmp/
# Default location for os commands is the local filesystem
import os
os.listdir('/dbfs/tmp')
# With %fs and dbutils.fs, you must use file:/ to read from local filesystem
%fs ls file:/tmp
%fs mkdirs file:/tmp/my_local_dir
dbutils.fs.ls ("file:/tmp/")
dbutils.fs.put("file:/tmp/my_new_file", "This is a file on the local driver node.")
# %sh reads from the local filesystem by default
%sh ls /tmp
Access files on mounted object storage
Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system.
dbutils.fs.ls("/mnt/mymount")
df = spark.read.format("text").load("dbfs:/mnt/mymount/my_file.txt")
Local file API limitations
The following lists the limitations in local file API usage with DBFS root and mounts in Databricks Runtime.
No credential passthrough.
No random writes. For workloads that require random writes, perform the operations on local disk first and then copy the result to
/dbfs
. For example:
# python
import xlsxwriter
from shutil import copyfile
workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Key")
worksheet.write(0, 1, "Value")
workbook.close()
copyfile('/local_disk0/tmp/excel.xlsx', '/dbfs/tmp/excel.xlsx')
No sparse files. To copy sparse files, use
cp --sparse=never
:
$ cp sparse.file /dbfs/sparse.file
error writing '/dbfs/sparse.file': Operation not supported
$ cp --sparse=never sparse.file /dbfs/sparse.file