Prepare data for distributed training

This article describes the methods for preparing data for distributed training: Mosaic Streaming and TFRecords.

Mosaic Streaming (Recommended)

Mosaic Streaming is an open-source data loading library that enables efficient streaming of large datasets from cloud storage. This library excels in handling massive datasets that don’t fit in memory, as it’s specifically designed for multi-node, distributed training of large models. Mosaic Streaming offers seamless integration with PyTorch and the MosaicML ecosystem. The following article illustrates this use case:

Load data using Mosaic Streaming

TFRecord

You can also use TFRecord format as the data source for distributed deep learning. TFRecord format is a simple record-oriented binary format that many TensorFlow applications use for training data.

tf.data.TFRecordDataset is the TensorFlow dataset, which is comprised of records from TFRecords files. For more details about how to consume TFRecord data, see the TensorFlow guide Consuming TFRecord data.

The following articles describe and illustrate the recommended ways to save your data to TFRecord files and load TFRecord files:

Save Apache Spark DataFrames as TFRecord files