Load data

Note

The managed MLflow integration with Databricks on Google Cloud requires Databricks Runtime for Machine Learning 8.1 or above.

This section covers information about loading data specifically for ML and DL applications. For general information about loading data, see Data guide.

Store files for data loading and model checkpointing

Machine learning applications may need to use shared storage for data loading and model checkpointing. This is particularly important for distributed deep learning. Databricks provides Databricks File System (DBFS) for accessing data on a cluster using both Spark and local file APIs.

Load tabular data

You can load tabular machine learning data from tables or files (for example, see CSV file). You can convert Apache Spark DataFrames into pandas DataFrames using the PySpark method toPandas(), and then optionally convert to NumPy format using the pandas method to_numpy().

Prepare data for distributed training

This section covers two methods for preparing data for distributed training: Petastorm and TFRecords.