Deploy models for inference and prediction


The managed MLflow integration with Databricks on Google Cloud requires Databricks Runtime for Machine Learning 8.1 or above.

Databricks recommends that you use MLflow to deploy machine learning models. You can use MLflow to deploy models for batch or streaming inference or to set up a REST endpoint to serve the model.

This article describes how to deploy MLflow models for offline (batch and streaming) inference and online (real-time) serving. For general information about working with MLflow models, see Log, load, register, and deploy MLflow Models.


You can simplify model deployment by registering models to the MLflow Model Registry. After you have registered your model, you can automatically generate a notebook for batch inference or configure the model for online serving.

Offline (batch) predictions

This section includes instructions and examples for setting up batch predictions on Databricks.

Use MLflow for model inference

MLflow helps you generate code for batch or streaming inference.

You can also customize the code generated by either of the above options. See the following notebooks for examples:

  • The model inference example uses a model trained with scikit-learn and previously logged to MLflow to show how to load a model and use it to make predictions on data in different formats. The notebook illustrates how to apply the model as a scikit-learn model to a pandas DataFrame, and how to apply the model as a PySpark UDF to a Spark DataFrame.

  • The MLflow Model Registry example shows how to build, manage, and deploy a model with Model Registry. On that page, you can search for .predict to identify examples of offline (batch) predictions.

Create a Databricks job

To run batch or streaming predictions as a job, create a notebook or JAR that includes the code used to perform the predictions. Then, execute the notebook or JAR as a Databricks job. Jobs can be run either immediately or on a schedule.

Streaming inference

For streaming applications, use the Apache Spark Structured Streaming API. The Structured Streaming API is similar to that for batch operations. You can use the automatically generated notebook mentioned in the previous section as a template and modify it to use streaming instead of batch. See the Apache Spark MLlib pipelines and Structured Streaming example.

Inference with deep learning models

For information about and examples of deep learning model inference on Databricks, see the following articles:

Inference with MLlib and XGBoost4J models

For scalable model inference with MLlib and XGBoost4J models, use the native transform methods to perform inference directly on Spark DataFrames. The MLlib example notebooks include inference steps.

Customize and optimize model inference

When you use the MLflow APIs to run inference on Spark DataFrames, you can load the model as a Spark UDF and apply it at scale using distributed computing.

You can customize your model to add pre-processing or post-processing and to optimize computational performance for large models. A good option for customizing models is the MLflow pyfunc API, which allows you to wrap a model with custom logic.

If you need to do further customization, you can manually wrap your machine learning model in a Pandas UDF or a pandas Iterator UDF. See the deep learning examples.

For smaller datasets, you can also use the native model inference routines provided by the library.

MLflow Model Serving

For Python MLflow models, Databricks provides MLflow Model Serving, which allows you to host machine learning models from the Model Registry as REST endpoints that are updated automatically based on the availability of model versions and their stages.

Third-party model serving

To deploy a model to third-party serving frameworks, use mlflow.<deploy-type>.deploy().