AI and Machine Learning on Databricks
This article describes the tools that Databricks provides to help you build and monitor AI and ML workflows. The diagram shows how these components work together to help you implement your model development and deployment process.
Why use Databricks for machine learning and deep learning?
With Databricks, a single platform serves every step of the model development and deployment process, from raw data to inference tables that save every request and response for a served model. Data scientists, data engineers, ML engineers and DevOps can do their jobs using the same set of tools and a single source of truth for the data.
With the data intelligence platform, the ML platform and the data stack are the same system. The ML platform is built on top of the data layer. All data assets and artifacts, such as models and functions, are discoverable and governed in a single catalog. Using a single platform for data and models makes it possible to track lineage from the raw data to the production model. Built-in data and model monitoring saves quality metrics to tables that are also stored in the platform, making it easier to identify the root cause of model performance problems. For more information about how Databricks supports the full ML lifecycle and MLOps, see MLOps workflows on Databricks and MLOps Stacks: model development process as code.
Some of the key components of the data intelligence platform are:
Tasks |
Component |
---|---|
Govern and manage data, features, models, and functions. Also discovery, versioning, and lineage. |
|
Feature development and management |
|
Train models |
|
Track model development |
|
Serve custom models |
|
Build automated workflows and production-ready ETL pipelines |
|
Git integration |
Deep learning on Databricks
Configuring infrastructure for deep learning applications can be difficult.
Databricks Runtime for Machine Learning takes care of that for you, with clusters that have built-in compatible versions of the most common deep learning libraries like TensorFlow, PyTorch, and Keras. Databricks Runtime ML clusters also include pre-configured GPU support with drivers and supporting libraries. It also supports libraries like Ray to parallelize compute processing for scaling ML workflows and AI applications.
For machine learning applications, Databricks recommends using a cluster running Databricks Runtime for Machine Learning. See Create a cluster using Databricks Runtime ML.
To get started with deep learning on Databricks, see:
Large language models (LLMs) and generative AI on Databricks
Databricks Runtime for Machine Learning includes libraries like Hugging Face Transformers and LangChain that allow you to integrate existing pre-trained models or other open-source libraries into your workflow. The Databricks MLflow integration makes it easy to use the MLflow tracking service with transformer pipelines, models, and processing components. In addition, you can integrate OpenAI models or solutions from partners like John Snow Labs in your Databricks workflows.
With Databricks, you can customize a LLM on your data for your specific task. With the support of open source tooling, such as Hugging Face and DeepSpeed, you can efficiently take a foundation LLM and train it with your own data to improve its accuracy for your specific domain and workload. You can then leverage the custom LLM in your generative AI applications.
Databricks Runtime for Machine Learning
Databricks Runtime for Machine Learning (Databricks Runtime ML) automates the creation of a cluster with pre-built machine learning and deep learning infrastructure including the most common ML and DL libraries. For the full list of libraries in each version of Databricks Runtime ML, see the release notes.
To access data in Unity Catalog for machine learning workflows, the access mode for the cluster must be single user (assigned). Shared clusters are not compatible with Databricks Runtime for Machine Learning. In addition, Databricks Runtime ML is not supported on TableACLs clusters or clusters with spark.databricks.pyspark.enableProcessIsolation config
set to true
.
Create a cluster using Databricks Runtime ML
When you create a cluster, select a Databricks Runtime ML version from the Databricks runtime version drop-down menu. Both CPU and GPU-enabled ML runtimes are available.
If you select a cluster from the drop-down menu in the notebook, the Databricks Runtime version appears at the right of the cluster name:
If you select a GPU-enabled ML runtime, you are prompted to select a compatible Driver type and Worker type. Incompatible instance types are grayed out in the drop-down menu. GPU-enabled instance types are listed under the GPU accelerated label.
Note
To access data in Unity Catalog for machine learning workflows, the access mode for the cluster must be single user (assigned). Shared clusters are not compatible with Databricks Runtime for Machine Learning. For details about how to create a cluster, see Compute configuration reference.
Photon and Databricks Runtime ML
When you create a CPU cluster running Databricks Runtime 15.2 ML or above, you can choose to enable Photon. Photon improves performance for applications using Spark SQL, Spark DataFrames, feature engineering, GraphFrames, and xgboost4j. It is not expected to improve performance on applications using Spark RDDs, Pandas UDFs, and non-JVM languages such as Python. Thus, Python packages such as XGBoost, PyTorch, and TensorFlow will not see an improvement with Photon.
Spark RDD APIs and Spark MLlib have limited compatibility with Photon. When processing large datasets using Spark RDD or Spark MLlib, you may experience Spark memory issues. See Spark memory issues.
Libraries included in Databricks Runtime ML
Databricks Runtime ML includes a variety of popular ML libraries. The libraries are updated with each release to include new features and fixes.
Databricks has designated a subset of the supported libraries as top-tier libraries. For these libraries, Databricks provides a faster update cadence, updating to the latest package releases with each runtime release (barring dependency conflicts). Databricks also provides advanced support, testing, and embedded optimizations for top-tier libraries.
For a full list of top-tier and other provided libraries, see the release notes for Databricks Runtime ML.