Deploy models for batch inference and prediction

This article describes how to deploy MLflow models for offline (batch and streaming) inference. Databricks recommends that you use MLflow to deploy machine learning models for batch or streaming inference. For general information about working with MLflow models, see Log, load, and register MLflow models.

Use MLflow for model inference

MLflow helps you generate code for batch or streaming inference.

In the MLflow Model Registry, you can automatically generate a notebook for batch or streaming inference via Delta Live Tables.
In the MLflow Run page for your model, you can copy the generated code snippet for inference on pandas or Apache Spark DataFrames.

You can also customize the code generated by either of the above options. See the following notebooks for examples:

The model inference example uses a model trained with scikit-learn and previously logged to MLflow to show how to load a model and use it to make predictions on data in different formats. The notebook illustrates how to apply the model as a scikit-learn model to a pandas DataFrame, and how to apply the model as a PySpark UDF to a Spark DataFrame.
Manage model lifecycle in Unity Catalog shows how to build, manage, and deploy a model with Model Registry. On that page, you can search for .predict to identify examples of offline (batch) predictions. Also see the example notebook on that page.

Create a Databricks job

To run batch or streaming predictions as a job, create a notebook or JAR that includes the code used to perform the predictions. Then, execute the notebook or JAR as a Databricks job. Jobs can be run either immediately or on a schedule. See Overview of orchestration on Databricks.

Streaming inference

From the MLflow Model Registry, you can automatically generate a notebook that integrates the MLflow PySpark inference UDF with Delta Live Tables.

You can also modify the generated inference notebook to use the Apache Spark Structured Streaming API.

Inference with deep learning models

For information about and examples of deep learning model inference on Databricks, see the following articles:

Inference with MLlib and XGBoost4J models

For scalable model inference with MLlib and XGBoost4J models, use the native transform methods to perform inference directly on Spark DataFrames. The MLlib example notebooks include inference steps.

Customize and optimize model inference

When you use the MLflow APIs to run inference on Spark DataFrames, you can load the model as a Spark UDF and apply it at scale using distributed computing.

You can customize your model to add pre-processing or post-processing and to optimize computational performance for large models. A good option for customizing models is the MLflow pyfunc API, which allows you to wrap a model with custom logic.

If you need to do further customization, you can manually wrap your machine learning model in a Pandas UDF or a pandas Iterator UDF. See the deep learning examples.

For smaller datasets, you can also use the native model inference routines provided by the library.