Prepare data for fine tuning Hugging Face models

This article demonstrates how to prepare your data for fine-tuning open source large language models with Hugging Face Transformers and Hugging Face Datasets.

Requirements

Load data from Hugging Face

Hugging Face Datasets is a Hugging Face library for accessing and sharing datasets for audio, computer vision, and natural language processing (NLP) tasks. With Hugging Face datasets you can load data from various places. The datasets library has utilities for reading datasets from the Hugging Face Hub. There are many datasets downloadable and readable from the Hugging Face Hub by using the load_dataset function. Learn more about loading data with Hugging Face Datasets in the Hugging Face documentation.

from datasets import load_dataset
dataset = load_dataset("imdb")

Some datasets in the Hugging Face Hub provide the sizes of data that is downloaded and generated when load_dataset is called. You can use load_dataset_builder to know the sizes before downloading the dataset with load_dataset.

from datasets import load_dataset_builder
from psutil._common import bytes2human

def print_dataset_size_if_provided(*args, **kwargs):
  dataset_builder = load_dataset_builder(*args, **kwargs)

  if dataset_builder.info.download_size and dataset_builder.info.dataset_size:
    print(f'download_size={bytes2human(dataset_builder.info.download_size)}, dataset_size={bytes2human(dataset_builder.info.dataset_size)}')
  else:
    print('Dataset size is not provided by uploader')

print_dataset_size_if_provided("imdb")

See the Download datasets from Hugging Face best practices notebook for guidance on how to download and prepare datasets on Databricks for different sizes of data.

Format your training and evaluation data

To use your own data for model fine-tuning, you must first format your training and evaluation data into Spark DataFrames. Then, load the DataFrames using the Hugging Face datasets library.

Start by formatting your training data into a table meeting the expectations of the trainer. For text classification, this is a table with two columns: a text column and a column of labels.

To perform fine-tuning, you need to provide a model. The Hugging Face Transformer AutoClasses library makes it easy to load models and configuration settings, including a wide range of Auto Models for natural language processing.

For example, Hugging Face transformers provides AutoModelForSequenceClassification as a model loader for text classification, which expects integer IDs as the category labels. However, if you have a DataFrame with string labels, you must also specify mappings between the integer labels and string labels when creating the model. You can collect this information as follows:

labels = df.select(df.label).groupBy(df.label).count().collect()
id2label = {index: row.label for (index, row) in enumerate(labels)}
label2id = {row.label: index for (index, row) in enumerate(labels)}

Then, create the integer IDs as a label column with a Pandas UDF:

from pyspark.sql.functions import pandas_udf
import pandas as pd
@pandas_udf('integer')
def replace_labels_with_ids(labels: pd.Series) -> pd.Series:
  return labels.apply(lambda x: label2id[x])

df_id_labels = df.select(replace_labels_with_ids(df.label).alias('label'), df.text)

Load a Hugging Face dataset from a Spark DataFrame

Hugging Face datasets supports loading from Spark DataFrames using datasets.Dataset.from_spark. See the Hugging Face documentation to learn more about the from_spark() method.

For example, if you have train_df and test_df DataFrames, you can create datasets for each with the following code:

import datasets
train_dataset = datasets.Dataset.from_spark(train_df, cache_dir="/dbfs/cache/train")
test_dataset = datasets.Dataset.from_spark(test_df, cache_dir="/dbfs/cache/test")

Dataset.from_spark caches the dataset. This example describes model training on the driver, so data must be made available to it. Additionally, since cache materialization is parallelized using Spark, the provided cache_dir must be accessible to all workers. To satisfy these constraints, cache_dir should be a Databricks File System (DBFS) root volume or mount point.

The DBFS root volume is accessible to all users of the workspace and should only be used for data without access restrictions. If your data requires access controls, use a mount point instead of DBFS root.

If your dataset is large, writing it to DBFS can take a long time. To speed up the process, you can use the working_dir parameter to have Hugging Face datasets write the dataset to a temporary location on disk, then move it to DBFS. For example, to use the SSD as a temporary location:

import datasets
dataset = datasets.Dataset.from_spark(
  train_df,
  cache_dir="/dbfs/cache/train",
  working_dir="/local_disk0/tmp/train",
)

Caching for datasets

The cache is one of the ways datasets improves efficiency. It stores all downloaded and processed datasets so when the user needs to use the intermediate datasets, they are reloaded directly from the cache.

The default cache directory of datasets is ~/.cache/huggingface/datasets. When a cluster is terminated, the cache data is lost too. To persist the cache file on cluster termination, Databricks recommends changing the cache location to DBFS by setting the environment variable HF_DATASETS_CACHE:

import os
os.environ["HF_DATASETS_CACHE"] = "/dbfs/place/you/want/to/save"

Fine-tune a model

When your data is ready, you can use it to fine-tune a Hugging Face model.

Notebook: Download datasets from Hugging Face

This example notebook provides recommended best practices of using the Hugging Face load_dataset function to download and prepare datasets on Databricks for different sizes of data.

Download datasets from Hugging Face best practices notebook

Open notebook in new tab