LangChain on Databricks for LLM development

Important

These are experimental features and the API definitions might change.

This article describes the LangChain integrations that facilitate the development and deployment of large language models (LLMs) on Databricks.

With these LangChain integrations you can:

  • Seamlessly load data from a PySpark DataFrame with the PySpark DataFrame loader.

  • Wrap your Databricks served model as a large language model (LLM) in LangChain.

What is LangChain?

LangChain is a software framework designed to help create applications that utilize large language models (LLMs). LangChain’s strength lies in its wide array of integrations and capabilities. It includes API wrappers, web scraping subsystems, code analysis tools, document summarization tools, and more. It also supports large language models from OpenAI, Anthropic, HuggingFace, etc. out of the box along with various data sources and types.

LangChain is available as an experimental MLflow flavor which allows LangChain customers to leverage the robust tools and experiment tracking capabilities of MLflow directly from the Databricks environment. See the LangChain flavor MLflow documentation.

Requirements

  • Databricks Runtime 13.3 ML and above.

  • Databricks recommends pip installing the latest version of LangChain to ensure you have the most recent updates.

    • %pip install --upgrade langchain

Load data with the PySpark DataFrame loader

The PySpark DataFrame loader in LangChain simplifies loading data from a PySpark DataFrame with a single method.

from langchain.document_loaders import PySparkDataFrameLoader

loader = PySparkDataFrameLoader(spark, wikipedia_dataframe, page_content_column="text")
documents = loader.load()

The following notebook showcases an example where the PySpark DataFrame loader is used to create a retrieval based chatbot that is logged with MLflow, which in turn allows the model to be interpreted as a generic Python function for inference with mlflow.pyfunc.load_model().

PySpark DataFrame loader and MLflow in Langchain notebook

Open notebook in new tab

Databricks SQL Agent

The Databricks SQL Agent is a variant of the standard SQL Database Agent that LangChain provides.

With the Databricks SQL Agent any Databricks users can interact with a specified schema in Unity Catalog and generate insights on their data.

Important

The Databricks SQL Agent can only query tables, and does not create tables.

In the following example the database instance is created within the SQLDatabase.from_databricks(catalog="...", schema="...") command and the agent and required tools are created by SQLDatabaseToolkit(db=db, llm=llm) and create_sql_agent(llm=llm, toolkit=toolkit, **kwargs), respectively.

from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain import OpenAI

db = SQLDatabase.from_databricks(catalog="samples", schema="nyctaxi")
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", temperature=.7)
toolkit = SQLDatabaseToolkit(db=db, llm=llm)
agent = create_sql_agent(llm=llm, toolkit=toolkit, verbose=True)

agent.run("What is the longest trip distance and how long did it take?")

Note

OpenAI models require a paid subscription, if the free subscription hits a rate limit.

The following notebook demonstrates how to create and use the Databricks SQL Agent to help you better understand the data in your database.

Use LangChain to interact with a SQL database notebook

Open notebook in new tab

Wrap Databricks served models as LLMs

If you have an LLM that you created on Databricks, you can use it directly within LangChain in the place of OpenAI, HuggingFace, or any other LLM provider.

This integration supports using model serving with a cluster driver proxy application, which is recommended for interactive development.

To wrap a cluster driver proxy application as an LLM in LangChain you need:

  • An LLM loaded on a Databricks interactive cluster in “single user” or “no isolation shared” mode.

  • A local HTTP server running on the driver node to serve the model at “/” using HTTP POST with JSON input/output.

  • An app uses a port number between [3000, 8000] and listens to the driver IP address or simply 0.0.0.0 instead of localhost only.

  • The CAN ATTACH TO permission to the cluster.

See the LangChain documentation for Wrapping a cluster driver proxy app for an example.