This article provides an introduction to Hugging Face Transformers on Databricks. It includes guidance on why to use Hugging Face Transformers and how to install it on your cluster.
Hugging Face Transformers is an open-source framework for deep learning created by Hugging Face. It provides APIs and tools to download state-of-the-art pre-trained models and further tune them to maximize performance. These models support common tasks in different modalities, such as natural language processing, computer vision, audio, and multi-modal applications.
Databricks Runtime for Machine Learning includes Hugging Face
transformers in Databricks Runtime 10.4 LTS ML and above, and includes Hugging Face datasets, accelerate, and evaluate in Databricks Runtime 13.0 ML and above.
To check which version of Hugging Face is included in your configured Databricks Runtime ML version, see the Python libraries section on the relevant release notes.
For many applications, such as sentiment analysis and text summarization, pre-trained models work well without any additional model training.
Hugging Face Transformers pipelines encode best practices and have default models selected for different tasks, making it easy to get started. Pipelines make it easy to use GPUs when available and allow batching of items sent to the GPU for better throughput performance.
Hugging Face provides:
A model hub containing many pre-trained models.
The 🤗 Transformers library that supports the download and use of these models for NLP applications and fine-tuning. It is common to need both a tokenizer and a model for natural language processing tasks.
🤗 Transformers pipelines that have a simple interface for most natural language processing tasks.
If the Databricks Runtime version on your cluster does not include Hugging Face
transformers, you can install the latest Hugging Face
transformers library as a Databricks PyPI library.
%pip install transformers
Different models may have different dependencies. Databricks recommends that you use %pip magic commands to install these dependencies as needed.
The following are common dependencies:
librosa: supports decoding audio files.
soundfile: required while generating some audio datasets.
bitsandbytes: required when using
SentencePiece: used as the tokenizer for NLP models.
timm: required by DetrForSegmentation.
To test and migrate single-machine workflows, use a Single Node cluster.
The following articles include example notebooks and guidance for how to use Hugging Face
transformers for large language model (LLM) fine-tuning and model inference on Databricks.