What is Auto Loader directory listing mode?
Auto Loader uses directory listing mode by default. In directory listing mode, Auto Loader identifies new files by listing the input directory. Directory listing mode allows you to quickly start Auto Loader streams without any permission configurations other than access to your data on cloud storage.
For best performance with directory listing mode, use Databricks Runtime 9.1 or above. This article describes the default functionality of directory listing mode as well as optimizations based on lexical ordering of files.
How does directory listing mode work?
Databricks has optimized directory listing mode for Auto Loader to discover files in cloud storage more efficiently than other Apache Spark options.
For example, if you have files being uploaded every 5 minutes as
/some/path/YYYY/MM/DD/HH/fileName, to find all the files in these directories, the Apache Spark file source lists all subdirectories in parallel. The following algorithm estimates the total number of API
LIST directory calls to object storage:
1 (base directory) + 365 (per day) * 24 (per hour) = 8761 calls
By receiving a flattened response from storage, Auto Loader reduces the number of API calls to the number of files in storage divided by the number of results returned by each API call, greatly reducing your cloud costs. The following table shows the number of files returned by each API call for common object storage:
Results returned per call
Available in Databricks Runtime 9.1 LTS and above.
Incremental listing is available for Azure Data Lake Storage Gen2 (
abfss://), S3 (
s3://) and GCS (
For lexicographically generated files, Auto Loader leverages the lexical file ordering and optimized listing APIs to improve the efficiency of directory listing by listing from recently ingested files rather than listing the contents of the entire directory.
By default, Auto Loader automatically detects whether a given directory is applicable for incremental listing by checking and comparing file paths of previously completed directory listings. To ensure eventual completeness of data in
auto mode, Auto Loader automatically triggers a full directory list after completing 7 consecutive incremental lists. You can control the frequency of full directory lists by setting
cloudFiles.backfillInterval to trigger asynchronous backfills at a given interval.
You can explicitly enable or disable incremental listing by setting
"auto"). When explicitly enabled, Auto Loader does not trigger full directory lists unless a backfill interval is set. Services like AWS Kinesis Firehose, AWS DMS, and Azure Data Factory are services that can be configured to upload files to a storage system in lexical order.
Lexical ordering of files
For files to be lexically ordered, new files that are uploaded need to have a prefix that is lexicographically greater than existing files. Some examples of lexical ordered directories are shown below.
Delta Lake makes commits to table transaction logs in a lexical order.
<path_to_table>/_delta_log/00000000000000000000.json <path_to_table>/_delta_log/00000000000000000001.json <- guaranteed to be written after version 0 <path_to_table>/_delta_log/00000000000000000002.json <- guaranteed to be written after version 1 ...
AWS DMS uploads CDC files to AWS S3 in a versioned manner.
database_schema_name/table_name/LOAD00000001.csv database_schema_name/table_name/LOAD00000002.csv ...
Date partitioned files
Files can be uploaded in a date partitioned format and leverage incremental listing. Some examples of this are:
// <base_path>/yyyy/MM/dd/HH:mm:ss-randomString <base_path>/2021/12/01/10:11:23-b1662ecd-e05e-4bb7-a125-ad81f6e859b4.json <base_path>/2021/12/01/10:11:23-b9794cf3-3f60-4b8d-ae11-8ea320fad9d1.json ... // <base_path>/year=yyyy/month=MM/day=dd/hour=HH/minute=mm/randomString <base_path>/year=2021/month=12/day=04/hour=08/minute=22/442463e5-f6fe-458a-8f69-a06aa970fc69.csv <base_path>/year=2021/month=12/day=04/hour=08/minute=22/8f00988b-46be-4112-808d-6a35aead0d44.csv <- this may be uploaded before the file above as long as processing happens less frequently than a minute
When files are uploaded with date partitioning, some things to keep in mind are:
Months, days, hours, minutes need to be left padded with zeros to ensure lexical ordering (should be uploaded as
hour=03, instead of
Files don’t necessarily have to be uploaded in lexical order in the deepest directory as long as processing happens less frequently than the parent directory’s time granularity.
Some services that can upload files in a date partitioned lexical ordering are:
Azure Data Factory can be configured to upload files in a lexical order. See an example here.
Change source path for Auto Loader
In Databricks Runtime 11.3 LTS and above, you can change the directory input path for Auto Loader configured with directory listing mode without having to choose a new checkpoint directory.
This functionality is not supported for file notification mode.
For example, if you wish to run a daily ingestion job that loads all data from a directory structure organized by day, such as
/YYYYMMDD/, you can use the same checkpoint to track ingestion state information across a different source directory each day while maintaining state information for files ingested from all previously used source directories.