Databricks recommends that you use the binary file data source to load image data into the Spark DataFrame as raw bytes. See Reference solution for image applications for the recommended workflow to handle image data.
The image data source abstracts from the details of image representations and provides a standard API to load image data. To read image files, specify the data source
df = spark.read.format("image").load("<path-to-image-data>")
Similar APIs exist for Scala, Java, and R.
You can import a nested directory structure (for example, use a path like
/path/to/dir/) and you can use partition discovery by specifying a path with a partition directory (that is, a path like
Image files are loaded as a DataFrame containing a single struct-type column called
image with the following fields:
image: struct containing all the image data |-- origin: string representing the source URI |-- height: integer, image height in pixels |-- width: integer, image width in pixels |-- nChannels |-- mode |-- data
where the fields are:
nChannels: The number of color channels. Typical values are 1 for grayscale images, 3 for colored images (for example, RGB), and 4 for colored images with alpha channel.
mode: Integer flag that indicates how to interpret the data field. It specifies the data type and channel order the data is stored in. The value of the field is expected (but not enforced) to map to one of the OpenCV types displayed in the following table. OpenCV types are defined for 1, 2, 3, or 4 channels and several data types for the pixel values. Channel order specifies the order in which the colors are stored. For example, if you have a typical three channel image with red, blue, and green components, there are six possible orderings. Most libraries use either RGB or BGR. Three (four) channel OpenCV types are expected to be in BGR(A) order.
Map of Type to Numbers in OpenCV (data types x number of channels)
data: Image data stored in a binary format. Image data is represented as a 3-dimensional array with the dimension shape (height, width, nChannels) and array values of type t specified by the mode field. The array is stored in row-major order.
display function supports displaying image data. See Images.
The image data source decodes the image files during the creation of the Spark DataFrame, increases the data size, and introduces limitations in the following scenarios:
Persisting the DataFrame: If you want to persist the DataFrame into a Delta table for easier access, you should persist the raw bytes instead of the decoded data to save disk space.
Shuffling the partitions: Shuffling the decoded image data takes more disk space and network bandwidth, which results in slower shuffling. You should delay decoding the image as much as possible.
Choosing other decoding method: The image data source uses the Image IO library of javax to decode the image, which prevents you from choosing other image decoding libraries for better performance or implementing customized decoding logic.
Those limitations can be avoided by using the binary file data source to load image data and decoding only as needed.