LangChain on Databricks for LLM development
Important
These are experimental features and the API definitions might change.
This article describes the LangChain integrations that facilitate the development and deployment of large language models (LLMs) on Databricks.
With these LangChain integrations you can:
Use Databricks-served models as LLMs or embeddings in your LangChain application.
Manage and track your LangChain models and performance in MLflow experiments.
Trace the development and production phases of your LangChain application with MLflow Tracing.
Seamlessly load data from a PySpark DataFrame with the PySpark DataFrame loader.
Interactively query your data using natural language with the Spark DataFrame Agent or Databricks SQL Agent.
What is LangChain?
LangChain is a software framework designed to help create applications that utilize large language models (LLMs). LangChain’s strength lies in its wide array of integrations and capabilities. It includes API wrappers, web scraping subsystems, code analysis tools, document summarization tools, and more. It also supports large language models from OpenAI, Anthropic, HuggingFace, etc. out of the box along with various data sources and types.
Leverage MLflow for LangChain development
LangChain is available as an MLflow flavor, which enables users to harness MLflow’s robust tools for experiment tracking and observability in both development and production environments directly within Databricks. For more details and guidance on using MLflow with LangChain, see the MLflow LangChain flavor documentation.
MLflow on Databricks offers additional features that distinguish it from the open-source version, enhancing your development experience with the following capabilities:
Fully managed MLflow Tracking Server: Instantly available within your Databricks workspace, allowing you to start tracking experiments without setup delays.
Seamless integration with Databricks Notebooks: Experiments are automatically linked to notebooks, streamlining the tracking process.
MLflow Tracing on Databricks: Provides production-level monitoring with inference table integration, ensuring end-to-end observability from development to production.
Model lifecycle management with Unity Catalog: Centralized control over access, auditing, lineage, and model discovery across your workspaces.
By leveraging these features, you can optimize the development, monitoring, and management of your LangChain-based projects, making Databricks a premier choice for MLflow-powered AI initiatives.
Requirements
Databricks Runtime 13.3 ML or above.
Install the LangChain Databricks integration package and Databricks SQL connector. Databricks also recommends pip installing the latest version of LangChain to ensure you have the most recent updates.
%pip install --upgrade langchain-databricks langchain-community langchain databricks-sql-connector
Use Databricks served models as LLMs or embeddings
This integration supports using model serving with a cluster driver proxy application for interactive development.
To wrap a cluster driver proxy application as an LLM in LangChain you need:
An LLM loaded on a Databricks interactive cluster in “single user” or “no isolation shared” mode.
A local HTTP server running on the driver node to serve the model at “/” using HTTP POST with JSON input/output.
An app uses a port number between [3000, 8000] and listens to the driver IP address or simply 0.0.0.0 instead of
localhost
only.The CAN ATTACH TO permission to the cluster.
See the LangChain documentation for Wrapping a cluster driver proxy app for an example.
Use Unity Catalog function as tools
Note
The Unity Catalog function integration is in the langchain-community
package. You must install it using %pip install langchain-community
to access its functionality. This integration will migrate to langchain-databricks
package in an upcoming release.
You can expose SQL or Python functions in Unity Catalog as tools for your LangChain agent. For full guidance on creating Unity Catalog functions and using them in LangChain, see the Databricks UC Toolkit documentation.
Load data with the PySpark DataFrame loader
The PySpark DataFrame loader in LangChain simplifies loading data from a PySpark DataFrame with a single method.
from langchain.document_loaders import PySparkDataFrameLoader
loader = PySparkDataFrameLoader(spark, wikipedia_dataframe, page_content_column="text")
documents = loader.load()
The following notebook showcases an example where the PySpark DataFrame loader is used to create a retrieval based chatbot that is logged with MLflow, which in turn allows the model to be interpreted as a generic Python function for inference with mlflow.pyfunc.load_model()
.
Spark DataFrame Agent
The Spark DataFrame Agent in LangChain allows interaction with a Spark DataFrame, optimized for question answering. LangChain’s Spark DataFrame Agent documentation provides a detailed example of how to create and use the Spark DataFrame Agent with a DataFrame.
from langchain.agents import create_spark_dataframe_agent
df = spark.read.csv("/databricks-datasets/COVID/coronavirusdataset/Region.csv", header=True, inferSchema=True)
display(df)
agent = create_spark_dataframe_agent(llm=OpenAI(temperature=0), df=df, verbose=True)
...
The following notebook demonstrates how to create and use the Spark DataFrame Agent to help you gain insights on your data.
Databricks SQL Agent
With the Databricks SQL Agent any Databricks users can interact with a specified schema in Unity Catalog and generate insights on their data.
Important
The Databricks SQL Agent can only query tables, and does not create tables.
In the following example the database instance is created within the SQLDatabase.from_databricks(catalog="...", schema="...")
command and the agent and required tools are created by SQLDatabaseToolkit(db=db, llm=llm)
and create_sql_agent(llm=llm, toolkit=toolkit, **kwargs)
, respectively.
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain import OpenAI
from langchain_databricks import ChatDatabricks
# Note: Databricks SQL connections eventually time out. We set pool_pre_ping: True to
# try to ensure connection health is checked before a SQL query is made
db = SQLDatabase.from_databricks(catalog="samples", schema="nyctaxi", engine_args={"pool_pre_ping": True})
llm = ChatDatabricks(
endpoint="databricks-meta-llama-3-1-70b-instruct",
temperature=0.1,
max_tokens=250,
)
toolkit = SQLDatabaseToolkit(db=db, llm=llm)
agent = create_sql_agent(llm=llm, toolkit=toolkit, verbose=True)
agent.run("What is the longest trip distance and how long did it take?")
The following notebook demonstrates how to create and use the Databricks SQL Agent to help you better understand the data in your database.