MLflow Tracing for agents

Important

This feature is in Public Preview.

This article describes MLflow Tracing and the scenarios where it is helpful for evaluating generative AI applications in your AI system.

In software development, tracing involves recording sequences of events like user sessions or request flows. In the context of AI systems, tracing often refers to interactions you have with an AI system. An example trace of an AI system might look like instrumenting the inputs and parameters for a RAG application that includes a user message with prompt, a vector lookup, and an interface with the generative AI model.

What is MLflow Tracing?

Using MLflow Tracing you can log, analyze, and compare traces across different versions of generative AI applications. It allows you to debug your generative AI Python code and keep track of inputs and responses. Doing so can help you discover conditions or parameters that contribute to poor performance of your application. MLflow Tracing is tightly integrated with Databricks tools and infrastructure, allowing you to store and display all your traces in Databricks notebooks or the MLflow experiment UI as you run your code.

When you develop AI systems on Databricks using libraries such as LangChain, LlamaIndex, OpenAI, or custom PyFunc, MLflow Tracing allows you to see all the events and intermediate outputs from each step of your agent. You can easily see the prompts, which models and retrievers were used, which documents were retrieved to augment the response, how long things took, and the final output. For example, if your model hallucinates, you can quickly inspect each step that led to the hallucination.

Why use MLflow Tracing?

MLflow Tracing provides several benefits to help you track your development workflow. For example, you can:

  • Interactive trace visualization and investigation tool for diagnosing issues in development.

  • Verify that prompt templates and guardrails are producing reasonable results.

  • Explore and minimize the latency impact of different frameworks, models, chunk sizes, and software development practices.

  • Measure application costs by tracking token usage by different models.

  • Establish benchmark (“golden”) datasets to evaluate the performance of different versions.

Install MLflow Tracing

MLflow Tracing is available in MLflow versions 2.13.0 and above.

%pip install mlflow>=2.13.0 -qqqU
%restart_python

Alternatively, you can %pip install databricks-agents to install the latest version of databricks-agents that includes a compatible MLflow version.

Use MLflow Tracing in development

MLflow Tracing helps you analyze performance issues and accelerate the agent development cycle. The following sections assume you are conducting agent development and MLflow Tracing from a notebook.

Note

In notebook environment, MLflow Tracing might add up to a few seconds of overhead to the agent run time.

Note

As of Databricks Runtime 15.4 LTS ML, MLflow tracing is enabled by default within notebooks. To disable tracing, for example with LangChain, you can execute mlflow.langchain.autolog(log_traces=False) in your notebook.

Add traces to your agent

MLflow Tracing provides three different ways to use traces with your generative AI application with traces. See Add traces to your agents for examples of using these methods. For API reference details, see the MLflow documentation.

API

Recommended Use Case

Description

MLflow autologging

Development on integrated GenAI libraries

Autologging automatically instruments traces for popular open source frameworks such as LangChain, LlamaIndex, and OpenAI. When you add mlflow.<library>.autolog() at the start of the notebook, MLflow automatically records traces for each step of your agent execution.

Fluent APIs

Custom agent with Pyfunc

Low-code APIs for instrumenting AI systems without worrying about the tree structure of the trace. MLflow determines the appropriate parent-child tree structure (spans) based on the Python stack.

MLflow Client APIs

Advanced use cases such as multi-threading

MLflowClient implements more granular, thread-safe APIs for advanced use cases. These APIs do not manage the parent-child relationship of the spans, so you need to manually specify it to construct the desired trace structure. This requires more code but gives you better control over the trace lifecycle, particularly for multi-threaded use cases.

Recommended for use cases that require more control, such as multi-threaded applications or callback-based instrumentation.

Reviewing traces

After you run the instrumented agent, you can review the generated traces in different ways:

  • The trace visualization is rendered inline in the cell output.

  • The traces are logged to your MLflow experiment. You can review the full list of historical traces and search on them in the Traces tab in the Experiment page. When the agent runs under an active MLflow Run, you can also find the traces in the Run page.

  • Programmatically retrieve traces using search_traces() API.

Limitations

  • MLflow Tracing is available in Databricks notebooks and notebook jobs.

  • LangChain autologging may not support all LangChain prediction APIs. Please refer to the MLflow documentation for the full list of supported APIs.