Ingest image data with Auto Loader

Note

Available in Databricks Runtime 9.0 and above.

Using Auto Loader to ingest image data into Delta Lake takes only a few lines of code. By using Auto Loader, you get the following benefits:

  • Automatic discovery of new files to process: You don’t need to have special logic to handle late arriving data or keep track of which files have been processed yourself.
  • Scalable file discovery: Auto Loader can ingest billions of files.
  • Optimized storage: Auto Loader can provide Delta Lake with additional information over the data to optimize file storage.
spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "binaryFile") \
  .load("<path_to_source_data>") \
  .writeStream \
  .option("checkpointLocation", "<path_to_checkpoint>") \
  .start("<path_to_target")
spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "binaryFile")
  .load("<path_to_source_data>")
  .writeStream
  .option("checkpointLocation", "<path_to_checkpoint>")
  .start("<path_to_target")

The preceding code will write your image data into a Delta table in an optimized format.

Use a Delta table for machine learning

Once the data is stored in Delta Lake, you can run distributed inference on the data. See the reference article for more details.