Using Auto Loader with Unity Catalog
Auto Loader can securely ingest data from external locations configured with Unity Catalog. To learn more about securely connecting storage with Unity Catalog, see Connect to cloud object storage using Unity Catalog. Auto Loader relies on Structured Streaming for incremental processing; for recommendations and limitations see Using Unity Catalog with Structured Streaming.
Note
In Databricks Runtime 11.3 LTS and above, you can use Auto Loader with either shared or single user access modes.
Directory listing mode is supported by default. File notification mode is only supported on single user compute.
Ingesting data from external locations managed by Unity Catalog with Auto Loader
You can use Auto Loader to ingest data from any external location managed by Unity Catalog. You must have READ FILES
permissions on the external location.
Specifying locations for Auto Loader resources for Unity Catalog
The Unity Catalog security model assumes that all storage locations referenced in a workload will be managed by Unity Catalog. Databricks recommends always storing checkpoint and schema evolution information in storage locations managed by Unity Catalog. Unity Catalog does not allow you to nest checkpoint or schema inference and evolution files under the table directory.
Examples
The follow examples assume the executing user has owner privileges on the target tables and the following configurations and grants:
Storage location |
Grant |
---|---|
gs://autoloader-source/json-data |
READ FILES |
gs://dev-bucket |
READ FILES, WRITE FILES, CREATE TABLE |
Using Auto Loader to load to a Unity Catalog managed table
checkpoint_path = "gs://dev-bucket/_checkpoint/dev_table"
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load("gs://autoloader-source/json-data")
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(availableNow=True)
.toTable("dev_catalog.dev_database.dev_table"))