Evaluate large language models with MLflow
This article describes how to use MLflow LLM Evaluate, MLflow’s large language model (LLM) evaluation functionality packaged in mlflow.evaluate. This article also describes what is needed to evaluate your LLM and what evaluation metrics are supported.
What is MLflow LLM Evaluate?
Evaluating LLM performance is slightly different from traditional ML models, as very often there is no single ground truth to compare against. MLflow provides an API, mlflow.evaluate()
to help evaluate your LLMs.
MLflow’s LLM evaluation functionality consists of three main components:
A model to evaluate: It can be an MLflow
pyfunc
model, a DataFrame with a predictions column, a URI that points to one registered MLflow model, or any Python callable that represents your model, such as a HuggingFace text summarization pipeline.Metrics: the metrics to compute, LLM evaluate uses LLM metrics.
Evaluation data: the data your model is evaluated at, it can be a Pandas DataFrame, a Python list, a
numpy
array or anmlflow.data.dataset.Dataset
instance.
Requirements
MLflow 2.8 and above.
In order to evaluate your LLM with
mlflow.evaluate()
, your LLM has to be one of the following:A
mlflow.pyfunc.PyFuncModel
instance or a URI pointing to a loggedmlflow.pyfunc.PyFuncModel
model.A custom Python function that takes in string inputs and outputs a single string. Your callable must match the signature of
mlflow.pyfunc.PyFuncModel.predict
without aparams
argument. The function should:Have
data
as the only argument, which can be apandas.Dataframe
,numpy.ndarray
, Python list, dictionary or scipy matrix.Return one of the following:
pandas.DataFrame
,pandas.Series
,numpy.ndarray
or list.
A static dataset.
Evaluate with an MLflow model
You can evaluate your LLM as an MLflow model. For detailed instruction on how to convert your model into a mlflow.pyfunc.PyFuncModel
instance, see how to Create a custom pyfunc model.
To evaluate your model as an MLflow model, Databricks recommends following these steps:
Note
To successfully log a model targeting Azure OpenAI Service, you must specify the following environment variables for authentication and functionality. See the OpenAI with MLflow documentation for more details.
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2023-05-15"
os.environ["OPENAI_API_BASE"] = "https://<>.<>.<>.com/"
os.environ["OPENAI_DEPLOYMENT_NAME"] = "deployment-name"
Package your LLM as an MLflow model and log it to MLflow server using
log_model
. Each flavor (opeanai
,pytorch
, …) has its ownlog_model
API, such asmlflow.openai.log_model()
:with mlflow.start_run(): system_prompt = "Answer the following question in two sentences" # Wrap "gpt-3.5-turbo" as an MLflow model. logged_model_info = mlflow.openai.log_model( model="gpt-3.5-turbo", task=openai.ChatCompletion, artifact_path="model", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": "{question}"}, ], )
Use the URI of logged model as the model instance in
mlflow.evaluate()
:results = mlflow.evaluate( logged_model_info.model_uri, eval_data, targets="ground_truth", model_type="question-answering", )
Evaluate with a custom function
In MLflow 2.8.0 and above, mlflow.evaluate()
supports evaluating a Python function without requiring the model be logged to MLflow. This is useful when you don’t want to log the model and just want to evaluate it. The following example uses mlflow.evaluate()
to evaluate a function.
You also need to set up OpenAI authentication to run the following code:
eval_data = pd.DataFrame(
{
"inputs": [
"What is MLflow?",
"What is Spark?",
],
"ground_truth": [
"MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
"Apache Spark is an open-source, distributed computing system designed for big data processing and analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, offering improvements in speed and ease of use. Spark provides libraries for various tasks such as data ingestion, processing, and analysis through its components like Spark SQL for structured data, Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
],
}
)
def openai_qa(inputs):
answers = []
system_prompt = "Please answer the following question in formal language."
for index, row in inputs.iterrows():
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "{row}"},
],
)
answers.append(completion.choices[0].message.content)
return answers
with mlflow.start_run() as run:
results = mlflow.evaluate(
openai_qa,
eval_data,
model_type="question-answering",
)
Evaluate with a static dataset
For MLflow 2.8.0 and above, mlflow.evaluate()
supports evaluating a static dataset without specifying a model. This is useful when you save the model output to a column in a Pandas DataFrame or an MLflow PandasDataset, and want to evaluate the static dataset without re-running the model.
Set model=None
, and put model outputs in the data
argument. This configuration is only applicable when the data is a Pandas DataFrame.
If you are using a Pandas DataFrame, you must specify the column name that contains the model output using the top-level predictions
parameter in mlflow.evaluate()
:
import mlflow
import pandas as pd
eval_data = pd.DataFrame(
{
"inputs": [
"What is MLflow?",
"What is Spark?",
],
"ground_truth": [
"MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. "
"It was developed by Databricks, a company that specializes in big data and machine learning solutions. "
"MLflow is designed to address the challenges that data scientists and machine learning engineers "
"face when developing, training, and deploying machine learning models.",
"Apache Spark is an open-source, distributed computing system designed for big data processing and "
"analytics. It was developed in response to limitations of the Hadoop MapReduce computing model, "
"offering improvements in speed and ease of use. Spark provides libraries for various tasks such as "
"data ingestion, processing, and analysis through its components like Spark SQL for structured data, "
"Spark Streaming for real-time data processing, and MLlib for machine learning tasks",
],
"predictions": [
"MLflow is an open-source platform that provides handy tools to manage Machine Learning workflow "
"lifecycle in a simple way",
"Spark is a popular open-source distributed computing system designed for big data processing and analytics.",
],
}
)
with mlflow.start_run() as run:
results = mlflow.evaluate(
data=eval_data,
targets="ground_truth",
predictions="predictions",
extra_metrics=[mlflow.metrics.genai.answer_similarity()],
evaluators="default",
)
print(f"See aggregated evaluation results below: \n{results.metrics}")
eval_table = results.tables["eval_results_table"]
print(f"See evaluation table below: \n{eval_table}")
LLM evaluation metric types
There are two types of LLM evaluation metrics in MLflow:
Metrics that rely on SaaS models, like OpenAI, for scoring such as
mlflow.metrics.genai.answer_relevance
. These metrics are created usingmlflow.metrics.genai.make_genai_metric()
. For each data record, these metrics send one prompt consisting of the following information to the SaaS model, and extract the score from the model response.Metrics definition.
Metrics grading criteria.
Reference examples.
Input data or context.
Model output.
[optional] Ground truth.
Function-based per-row metrics. These metrics calculate a score for each data record (row in terms of Pandas or Spark DataFrame), based on certain functions, like Rouge,
mlflow.metrics.rougeL
, or Flesch Kincaid,mlflow.metrics.flesch_kincaid_grade_level
. These metrics are similar to traditional metrics.
Select metrics to evaluate your LLM
You can select which metrics to evaluate your model. The full reference for supported evaluation metrics can be found in the MLflow evaluate documentation.
You can either:
Use the default metrics that are pre-defined for your model type.
Use a custom list of metrics.
To use defaults metrics for pre-selected tasks, specify the model_type
argument in mlflow.evaluate
, as shown by the example below:
results = mlflow.evaluate(
model,
eval_data,
targets="ground_truth",
model_type="question-answering",
)
The table summarizes the supported LLM model types and associated default metrics.
|
|
|
---|---|---|
exact-match |
||
*
Requires package evaluate, torch, and transformers.
**
Requires package textstat.
†
Requires package evaluate, nltk, and rouge-score.
Use a custom list of metrics
You can specify a custom list of metrics in the extra_metrics
argument in mlflow.evaluate
.
To add additional metrics to the default metrics list of pre-defined model type, keep the model_type
parameter and add your metrics to extra_metrics
. The following evaluates your model using all metrics for the question-answering
model and mlflow.metrics.latency()
.
results = mlflow.evaluate(
model,
eval_data,
targets="ground_truth",
model_type="question-answering",
extra_metrics=[mlflow.metrics.latency()],
)
To disable default metric calculation and only calculate your selected metrics, remove the model_type
argument and define the desired metrics.
results = mlflow.evaluate(model,
eval_data,
targets="ground_truth",
extra_metrics=[mlflow.metrics.toxicity(), mlflow.metrics.latency()],
)
Metrics with LLM as a judge
You can also add pre-canned metrics that use LLM as the judge to the extra_metrics
argument in mlflow.evaluate()
. For a list of these LLM as the judge metrics, see Metrics with LLM as the judge.
Note
You can also Create custom LLM as the judge and heuristic based evaluation metrics.
from mlflow.metrics.genai import answer_relevance
answer_relevance_metric = answer_relevance(model="openai:/gpt-4")
eval_df = pd.DataFrame() # Index(['inputs', 'predictions', 'context'], dtype='object')
eval_results = mlflow.evaluate(
data = eval_df, # evaluation data
model_type="question-answering",
predictions="predictions", # prediction column_name from eval_df
extra_metrics=[answer_relevance_metric]
)
View evaluation results
mlflow.evaluate()
returns the evaluation results as an mlflow.models.EvaluationResult
instance.
To see the score on selected metrics, you can check the following attributes of the evaluation result:
metrics
: This stores the aggregated results, like average or variance across the evaluation dataset. The following takes a second pass on the code example above and focuses on printing out the aggregated results.with mlflow.start_run() as run: results = mlflow.evaluate( data=eval_data, targets="ground_truth", predictions="predictions", extra_metrics=[mlflow.metrics.genai.answer_similarity()], evaluators="default", ) print(f"See aggregated evaluation results below: \n{results.metrics}")
tables['eval_results_table']
: This stores the per-row evaluation results.with mlflow.start_run() as run: results = mlflow.evaluate( data=eval_data, targets="ground_truth", predictions="predictions", extra_metrics=[mlflow.metrics.genai.answer_similarity()], evaluators="default", ) print( f"See per-data evaluation results below: \n{results.tables['eval_results_table']}" )