Evaluate large language models with MLflow

This article introduces MLflow LLM Evaluate, MLflow’s large language model (LLM) evaluation functionality packaged in mlflow.evaluate. This article also describes what is needed to evaluate your LLM and what evaluation metrics are supported.

What is MLflow LLM Evaluate?

Evaluating LLM performance is slightly different from traditional ML models, as very often there is no single ground truth to compare against. MLflow provides an API, mlflow.evaluate() to help evaluate your LLMs.

MLflow’s LLM evaluation functionality consists of three main components:

A model to evaluate: It can be an MLflow pyfunc model, a DataFrame with a predictions column, a URI that points to one registered MLflow model, or any Python callable that represents your model, such as a HuggingFace text summarization pipeline.
Metrics: the metrics to compute, LLM evaluate uses LLM metrics.
Evaluation data: the data your model is evaluated at, it can be a Pandas DataFrame, a Python list, a numpy array or an mlflow.data.dataset.Dataset instance.

Requirements

MLflow 2.8 and above.
In order to evaluate your LLM with mlflow.evaluate(), your LLM has to be one of the following:
- A mlflow.pyfunc.PyFuncModel instance or a URI pointing to a logged mlflow.pyfunc.PyFuncModel model.
- A custom Python function that takes in string inputs and outputs a single string. Your callable must match the signature of mlflow.pyfunc.PyFuncModel.predict without a params argument. The function should:
  - Have data as the only argument, which can be a pandas.Dataframe, numpy.ndarray, Python list, dictionary or scipy matrix.
  - Return one of the following: pandas.DataFrame, pandas.Series, numpy.ndarray or list.
- A static dataset.

Evaluate with an MLflow model

You can evaluate your LLM as an MLflow model. For detailed instruction on how to convert your model into a mlflow.pyfunc.PyFuncModel instance, see how to Create a custom pyfunc model.

To evaluate your model as an MLflow model, Databricks recommends following these steps:

Note

To successfully log a model targeting Azure OpenAI Service, you must specify the following environment variables for authentication and functionality. See the OpenAI with MLflow documentation for more details.

os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2023-05-15"
os.environ["OPENAI_API_BASE"] = "https://<>.<>.<>.com/"
os.environ["OPENAI_DEPLOYMENT_NAME"] = "deployment-name"

Package your LLM as an MLflow model and log it to MLflow server using log_model. Each flavor (opeanai, pytorch, …) has its own log_model API, such as mlflow.openai.log_model():

with mlflow.start_run():
    system_prompt = "Answer the following question in two sentences"
    # Wrap "gpt-3.5-turbo" as an MLflow model.
    logged_model_info = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.ChatCompletion,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

Use the URI of logged model as the model instance in mlflow.evaluate():

results = mlflow.evaluate(
    logged_model_info.model_uri,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)

Evaluate with a custom function

In MLflow 2.8.0 and above, mlflow.evaluate() supports evaluating a Python function without requiring the model be logged to MLflow. This is useful when you don’t want to log the model and just want to evaluate it. The following example uses mlflow.evaluate() to evaluate a function.

You also need to set up OpenAI authentication to run the following code:

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, offering improvements in speed and ease of use. Spark provides libraries for various tasks such as data ingestion, processing, and analysis through its components like Spark SQL for structured data, Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
        ],
    }
)


def openai_qa(inputs):
    answers = []
    system_prompt = "Please answer the following question in formal language."
    for index, row in inputs.iterrows():
        completion = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": "{row}"},
            ],
        )
        answers.append(completion.choices[0].message.content)

    return answers


with mlflow.start_run() as run:
    results = mlflow.evaluate(
        openai_qa,
        eval_data,
        model_type="question-answering",
    )

Evaluate with a static dataset

For MLflow 2.8.0 and above, mlflow.evaluate() supports evaluating a static dataset without specifying a model. This is useful when you save the model output to a column in a Pandas DataFrame or an MLflow PandasDataset, and want to evaluate the static dataset without re-running the model.

Set model=None, and put model outputs in the data argument. This configuration is only applicable when the data is a Pandas DataFrame.

If you are using a Pandas DataFrame, you must specify the column name that contains the model output using the top-level predictions parameter in mlflow.evaluate():

import mlflow
import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Spark?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. "
            "It was developed by Databricks, a company that specializes in big data and machine learning solutions. "
            "MLflow is designed to address the challenges that data scientists and machine learning engineers "
            "face when developing, training, and deploying machine learning models.",
            "Apache Spark is an open-source, distributed computing system designed for big data processing and "
            "analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, "
            "offering improvements in speed and ease of use. Spark provides libraries for various tasks such as "
            "data ingestion, processing, and analysis through its components like Spark SQL for structured data, "
            "Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
        ],
        "predictions": [
            "MLflow is an open-source platform that provides handy tools to manage Machine Learning workflow "
            "lifecycle in a simple way",
            "Spark is a popular open-source distributed computing system designed for big data processing and analytics.",
        ],
    }
)

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators="default",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")

LLM evaluation metric types

There are two types of LLM evaluation metrics in MLflow:

Metrics that rely on SaaS models, like OpenAI, for scoring such as mlflow.metrics.genai.answer_relevance. These metrics are created using mlflow.metrics.genai.make_genai_metric(). For each data record, these metrics send one prompt consisting of the following information to the SaaS model, and extract the score from the model response.
- Metrics definition.
- Metrics grading criteria.
- Reference examples.
- Input data or context.
- Model output.
- [optional] Ground truth.
Function-based per-row metrics. These metrics calculate a score for each data record (row in terms of Pandas or Spark DataFrame), based on certain functions, like Rouge, mlflow.metrics.rougeL, or Flesch Kincaid,mlflow.metrics.flesch_kincaid_grade_level. These metrics are similar to traditional metrics.

Select metrics to evaluate your LLM

You can select which metrics to evaluate your model. The full reference for supported evaluation metrics can be found in the MLflow evaluate documentation.

You can either:

Use the default metrics that are pre-defined for your model type.
Use a custom list of metrics.

To use defaults metrics for pre-selected tasks, specify the model_type argument in mlflow.evaluate, as shown by the example below:

results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
)

The table summarizes the supported LLM model types and associated default metrics.

`question-answering`	`text-summarization`	`text`
exact-match	ROUGE†	toxicity*
toxicity*	toxicity*	ari_grade_level**
ari_grade_level**	ari_grade_level**	flesch_kincaid_grade_level**
flesch_kincaid_grade_level**	flesch_kincaid_grade_level**

* Requires package evaluate, torch, and transformers.

** Requires package textstat.

† Requires package evaluate, nltk, and rouge-score.

Use a custom list of metrics

You can specify a custom list of metrics in the extra_metrics argument in mlflow.evaluate.

To add additional metrics to the default metrics list of pre-defined model type, keep the model_type parameter and add your metrics to extra_metrics. The following evaluates your model using all metrics for the question-answering model and mlflow.metrics.latency().

results = mlflow.evaluate(
    model,
    eval_data,
    targets="ground_truth",
    model_type="question-answering",
    extra_metrics=[mlflow.metrics.latency()],
)

To disable default metric calculation and only calculate your selected metrics, remove the model_type argument and define the desired metrics.

results = mlflow.evaluate(model,
                          eval_data,
                          targets="ground_truth",
                          extra_metrics=[mlflow.metrics.toxicity(), mlflow.metrics.latency()],
                        )

Metrics with LLM as a judge

You can also add pre-canned metrics that use LLM as the judge to the extra_metrics argument in mlflow.evaluate(). For a list of these LLM as the judge metrics, see Metrics with LLM as the judge.

Note

You can also Create custom LLM as the judge and heuristic based evaluation metrics.

from  mlflow.metrics.genai import answer_relevance

answer_relevance_metric = answer_relevance(model="openai:/gpt-4")

eval_df = pd.DataFrame() # Index(['inputs', 'predictions', 'context'], dtype='object')

eval_results = mlflow.evaluate(
    data = eval_df, # evaluation data
    model_type="question-answering",
    predictions="predictions", # prediction column_name from eval_df
    extra_metrics=[answer_relevance_metric]
)

View evaluation results

mlflow.evaluate() returns the evaluation results as an mlflow.models.EvaluationResult instance.

To see the score on selected metrics, you can check the following attributes of the evaluation result:

metrics: This stores the aggregated results, like average or variance across the evaluation dataset. The following takes a second pass on the code example above and focuses on printing out the aggregated results.

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators="default",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

tables['eval_results_table']: This stores the per-row evaluation results.

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators="default",
    )
    print(
        f"See per-data evaluation results below: \n{results.tables['eval_results_table']}"
    )

LLM evaluation with MLflow example notebook

The following LLM evaluation with MLflow example notebook is a use-case oriented example.

LLM evaluation with MLflow example notebook

Open notebook in new tab