Serve multiple models to a model serving endpoint

Preview

Mosaic AI Model Serving is in Public Preview and is supported in us-east1 and us-central1.

This article describes how to programmatically configure a model serving endpoint to serve multiple models and the traffic split between them.

Serving multiple models from a single endpoint enables you to split traffic between different models to compare their performance and facilitate A/B testing. You can also serve different versions of a model at the same time, which makes experimenting with new versions easier, while keeping the current version in production.

You can serve any of the following model types on a Mosaic AI Model Serving endpoint. You can not serve different model types in a single endpoint. For example you can not serve a custom model and an external model in the same endpoint.

Requirements

See the Requirements for model serving endpoint creation.

To understand access control options for model serving endpoints and best practice guidance for endpoint management, see Serving endpoint ACLs.

Create an endpoint and set the initial traffic split

When you create model serving endpoints using the Databricks Mosaic AI serving API or the Databricks Mosaic AI serving UI, you can also set the initial traffic split for the models you want to serve on that endpoint. The following sections provide examples of setting the traffic split for multiple custom models or generative AI models served on an endpoint.

Serve multiple custom models to an endpoint

The following REST API example creates a single endpoint with two custom models in Unity Catalog and sets the endpoint traffic split between those models. The served entity, current, hosts version 1 of model-A and gets 90% of the endpoint traffic, while the other served entity, challenger, hosts version 1 of model-B and gets 10% of the endpoint traffic.

POST /api/2.0/serving-endpoints

{
   "name":"multi-model"
   "config":
   {
      "served_entities":
      [
         {
            "name":"current",
            "entity_name":"catalog.schema.model-A",
            "entity_version":"1",
            "workload_size":"Small",
            "scale_to_zero_enabled":true
         },
         {
            "name":"challenger",
            "entity_name":"catalog.schema.model-B",
            "entity_version":"1",
            "workload_size":"Small",
            "scale_to_zero_enabled":true
         }
      ],
      "traffic_config":
      {
         "routes":
         [
            {
               "served_model_name":"current",
               "traffic_percentage":"90"
            },
            {
               "served_model_name":"challenger",
               "traffic_percentage":"10"
            }
         ]
      }
   }
}

Serve multiple external models to an endpoint

You can also configure multiple external models in a serving endpoint as long as they all have the same task type and each model has a unique name. You cannot have both external models and non-external models in the same serving endpoint.

The following example creates a serving endpoint that routes 50% of the traffic to gpt-4 provided by OpenAI and the remaining 50% to claude-3-opus-20240229 provided by Anthropic.

import mlflow.deployments

client = mlflow.deployments.get_deploy_client("databricks")

client.create_endpoint(
    name="mix-chat-endpoint",
    config={
        "served_entities": [
            {
                "name": "served_model_name_1",
                "external_model": {
                    "name": "gpt-4",
                    "provider": "openai",
                    "task": "llm/v1/chat",
                    "openai_config": {
                        "openai_api_key": "{{secrets/my_openai_secret_scope/openai_api_key}}"
                    }
                }
            },
            {
                "name": "served_model_name_2",
                "external_model": {
                    "name": "claude-3-opus-20240229",
                    "provider": "anthropic",
                    "task": "llm/v1/chat",
                    "anthropic_config": {
                        "anthropic_api_key": "{{secrets/my_anthropic_secret_scope/anthropic_api_key}}"
                    }
                }
            }
        ],
        "traffic_config": {
            "routes": [
                {"served_model_name": "served_model_name_1", "traffic_percentage": 50},
                {"served_model_name": "served_model_name_2", "traffic_percentage": 50}
            ]
        },
    }
)

Update the traffic split between served models

You can also update the traffic split between served models. The following REST API example sets the served model, current, to get 50% of the endpoint traffic and the other model, challenger, to get the remaining 50% of the traffic.

You can also make this update from the Serving tab in the Databricks Mosaic AI UI using the Edit configuration button.

PUT /api/2.0/serving-endpoints/{name}/config

{
   "served_entities":
   [
      {
         "name":"current",
         "entity_name":"catalog.schema.model-A",
         "entity_version":"1",
         "workload_size":"Small",
         "scale_to_zero_enabled":true
      },
      {
         "name":"challenger",
         "entity_name":"catalog.schema.model-B",
         "entity_version":"1",
         "workload_size":"Small",
         "scale_to_zero_enabled":true
      }
   ],
   "traffic_config":
   {
      "routes":
      [
         {
            "served_model_name":"current",
            "traffic_percentage":"50"
         },
         {
            "served_model_name":"challenger",
            "traffic_percentage":"50"
         }
      ]
   }
}

Query individual models behind an endpoint

In some scenarios, you might want to query individual models behind the endpoint.

You can do so by using:

POST /serving-endpoints/{endpoint-name}/served-models/{served-model-name}/invocations

Here the specific served model is queried. The request format is the same as querying the endpoint. While querying the individual served model, the traffic settings are ignored.

In the context of the multi-model endpoint example, if all requests are sent to /serving-endpoints/multi-model/served-models/challenger/invocations, then all requests are served by the challenger served model.