Query serving endpoints for custom models

Preview

Mosaic AI Model Serving is in Public Preview and is supported in us-east1 and us-central1.

In this article, learn how to format scoring requests for your served model, and how to send those requests to the model serving endpoint. The guidance is relevant to serving custom models, which Databricks defines as traditional ML models or customized Python models packaged in the MLflow format. The models must be registered in Unity Catalog. Examples include scikit-learn, XGBoost, PyTorch, and Hugging Face transformer models. See Deploy models using Mosaic AI Model Serving for more information about this functionality and supported model categories.

For query requests for generative AI and LLM workloads, see Query foundation models.

Requirements

A model serving endpoint.
For the MLflow Deployment SDK, MLflow 2.9 or above is required.
Scoring request in an accepted format.
To send a scoring request through the REST API or MLflow Deployment SDK, you must have a Databricks API token.

Querying methods and examples

Mosaic AI Model Serving provides the following options for sending scoring requests to served models:

Method	Details
Serving UI	Select Query endpoint from the Serving endpoint page in your Databricks workspace. Insert JSON format model input data and click Send Request. If the model has an input example logged, use Show Example to load it.
REST API	Call and query the model using the REST API. See POST /serving-endpoints/{name}/invocations for details. For scoring requests to endpoints serving multiple models, see Query individual models behind an endpoint.
MLflow Deployments SDK	Use MLflow Deployments SDK’s predict() function to query the model.

Pandas DataFrame scoring example

The following example assumes a MODEL_VERSION_URI like https://<databricks-instance>/model/iris-classifier/Production/invocations, where <databricks-instance> is the name of your Databricks instance, and a Databricks REST API token called DATABRICKS_API_TOKEN.

See Supported scoring formats.

Score a model accepting dataframe split input format.

curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
  -H 'Content-Type: application/json' \
  -d '{"dataframe_split": [{
    "columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
    "data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]
    }]
  }'

Score a model accepting tensor inputs. Tensor inputs should be formatted as described in TensorFlow Serving’s API documentation.

curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
  -H 'Content-Type: application/json' \
  -d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'

Important

The following example uses the predict() API from the MLflow Deployments SDK.

import mlflow.deployments

export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"

client = mlflow.deployments.get_deploy_client("databricks")

response = client.predict(
            endpoint="test-model-endpoint",
            inputs={"dataframe_split": {
                    "index": [0, 1],
                    "columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
                    "data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]
                    }
                }
          )

You can score a dataset in Power BI Desktop using the following steps:

Open dataset you want to score.
Go to Transform Data.
Right-click in the left panel and select Create New Query.
Go to View > Advanced Editor.

Replace the query body with the code snippet below, after filling in an appropriate DATABRICKS_API_TOKEN and MODEL_VERSION_URI.

(dataset as table ) as table =>
let
  call_predict = (dataset as table ) as list =>
  let
    apiToken = DATABRICKS_API_TOKEN,
    modelUri = MODEL_VERSION_URI,
    responseList = Json.Document(Web.Contents(modelUri,
      [
        Headers = [
          #"Content-Type" = "application/json",
          #"Authorization" = Text.Format("Bearer #{0}", {apiToken})
        ],
        Content = {"dataframe_records": Json.FromValue(dataset)}
      ]
    ))
  in
    responseList,
  predictionList = List.Combine(List.Transform(Table.Split(dataset, 256), (x) => call_predict(x))),
  predictionsTable = Table.FromList(predictionList, (x) => {x}, {"Prediction"}),
  datasetWithPrediction = Table.Join(
    Table.AddIndexColumn(predictionsTable, "index"), "index",
    Table.AddIndexColumn(dataset, "index"), "index")
in
  datasetWithPrediction

Name the query with your desired model name.
Open the advanced query editor for your dataset and apply the model function.

Tensor input example

The following example scores a model accepting tensor inputs. Tensor inputs should be formatted as described in TensorFlow Serving’s API docs. This example assumes a MODEL_VERSION_URI like https://<databricks-instance>/model/iris-classifier/Production/invocations, where <databricks-instance> is the name of your Databricks instance, and a Databricks REST API token called DATABRICKS_API_TOKEN.

curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
    -H 'Content-Type: application/json' \
    -d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'

Supported scoring formats

For custom models, Model Serving supports scoring requests in Pandas DataFrame or Tensor input.

Pandas DataFrame

Requests should be sent by constructing a JSON-serialized Pandas DataFrame with one of the supported keys and a JSON object corresponding to the input format.

(Recommended)dataframe_split format is a JSON-serialized Pandas DataFrame in the split orientation.

{
  "dataframe_split": {
    "index": [0, 1],
    "columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
    "data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]
  }
}

dataframe_records is JSON-serialized Pandas DataFrame in the records orientation.

Note

This format does not guarantee the preservation of column ordering, and the split format is preferred over the records format.

{
  "dataframe_records": [
  {
    "sepal length (cm)": 5.1,
    "sepal width (cm)": 3.5,
    "petal length (cm)": 1.4,
    "petal width (cm)": 0.2
  },
  {
    "sepal length (cm)": 4.9,
    "sepal width (cm)": 3,
    "petal length (cm)": 1.4,
    "petal width (cm)": 0.2
  },
  {
    "sepal length (cm)": 4.7,
    "sepal width (cm)": 3.2,
    "petal length (cm)": 1.3,
    "petal width (cm)": 0.2
  }
  ]
}

The response from the endpoint contains the output from your model, serialized with JSON, wrapped in a predictions key.

{
  "predictions": [0,1,1,1,0]
}

Tensor input

When your model expects tensors, like a TensorFlow or Pytorch model, there are two supported format options for sending requests: instances and inputs.

If you have multiple named tensors per row, then you have to have one of each tensor for every row.

instances is a tensors-based format that accepts tensors in row format. Use this format if all the input tensors have the same 0-th dimension. Conceptually, each tensor in the instances list could be joined with the other tensors of the same name in the rest of the list to construct the full input tensor for the model, which would only be possible if all of the tensors have the same 0-th dimension.
```
{"instances": [ 1, 2, 3 ]}
```
The following example shows how to specify multiple named tensors.
```
{
 "instances": [
  {
   "t1": "a",
   "t2": [1, 2, 3, 4, 5],
   "t3": [[1, 2], [3, 4], [5, 6]]
  },
  {
   "t1": "b",
   "t2": [6, 7, 8, 9, 10],
   "t3": [[7, 8], [9, 10], [11, 12]]
  }
 ]
}
```
inputs send queries with tensors in columnar format. This request is different because there are actually a different number of tensor instances of t2 (3) than t1 and t3, so it is not possible to represent this input in the instances format.
```
{
 "inputs": {
  "t1": ["a", "b"],
  "t2": [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]],
  "t3": [[[1, 2], [3, 4], [5, 6]], [[7, 8], [9, 10], [11, 12]]]
 }
}
```

The response from the endpoint is in the following format.

{
  "predictions": [0,1,1,1,0]
}

Notebook example

See the following notebook for an example of how to test your Model Serving endpoint with a Python model:

Test Model Serving endpoint notebook

Open notebook in new tab