Using Auto Loader with Unity Catalog

Auto Loader can securely ingest data from external locations configured with Unity Catalog. To learn more about securely connecting storage with Unity Catalog, see Connect to cloud object storage using Unity Catalog. Auto Loader relies on Structured Streaming for incremental processing; for recommendations and limitations see Using Unity Catalog with Structured Streaming.

Note

In Databricks Runtime 11.3 LTS and above, you can use Auto Loader with either shared or single user access modes.

Directory listing mode is supported by default. File notification mode is only supported on single user clusters.

Ingesting data from external locations managed by Unity Catalog with Auto Loader

You can use Auto Loader to ingest data from any external location managed by Unity Catalog. You must have READ FILES permissions on the external location.

Specifying locations for Auto Loader resources for Unity Catalog

The Unity Catalog security model assumes that all storage locations referenced in a workload will be managed by Unity Catalog. Databricks recommends always storing checkpoint and schema evolution information in storage locations managed by Unity Catalog. Unity Catalog does not allow you to nest checkpoint or schema inference and evolution files under the table directory.

Examples

The follow examples assume the executing user has owner privileges on the target tables and the following configurations and grants:

Storage location

Grant

gs://autoloader-source/json-data

READ FILES

gs://dev-bucket

READ FILES, WRITE FILES, CREATE TABLE

Using Auto Loader to load to a Unity Catalog managed table

checkpoint_path = "gs://dev-bucket/_checkpoint/dev_table"

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load("gs://autoloader-source/json-data")
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable("dev_catalog.dev_database.dev_table"))