Testing for Databricks Connect for Python

note

This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.

This article describes how to run tests using pytest for Databricks Connect for Databricks Runtime 13.3 LTS and above. For information about installing Databricks Connect, see Install Databricks Connect for Python.

You can run pytest on local code that does not need a connection to a cluster in a remote Databricks workspace. For example, you might use pytest to test your functions that accept and return PySpark DataFrame objects in local memory. To get started with pytest and run it locally, see Get Started in the pytest documentation.

note

When running Databricks Connect from the terminal, pytest only works with the DEFAULT configuration profile. The profile should include the Databricks compute you want to use, either a cluster or serverless compute. For information about configuring compute, see Compute configuration for Databricks Connect.

For example, given the following file named nyctaxi_functions.py containing a get_spark function that returns a SparkSession instance and a get_nyctaxi_trips function that returns a DataFrame representing the trips table in the samples catalog’s nyctaxi schema:

nyctaxi_functions.py:

Python
from databricks.connect import DatabricksSession
from pyspark.sql import DataFrame, SparkSession

def get_spark() -> SparkSession:
  spark = DatabricksSession.builder.getOrCreate()
  return spark

def get_nyctaxi_trips() -> DataFrame:
  spark = get_spark()
  df = spark.read.table("samples.nyctaxi.trips")
  return df

And given the following file named main.py that calls these get_spark and get_nyctaxi_trips functions:

main.py:

Python
from nyctaxi_functions import *

df = get_nyctaxi_trips()
df.show(5)

The following file named test_nyctaxi_functions.py tests whether the get_spark function returns a SparkSession instance and whether the get_nyctaxi_trips function returns a DataFrame that contains at least one row of data:

test_nyctaxi_functions.py:

Python
import pyspark.sql.connect.session
from nyctaxi_functions import *

def test_get_spark():
  spark = get_spark()
  assert isinstance(spark, pyspark.sql.connect.session.SparkSession)

def test_get_nyctaxi_trips():
  df = get_nyctaxi_trips()
  assert df.count() > 0

To run these tests, run the pytest command from the code project’s root, which should produce test results similar to the following:

Bash
$ pytest
=================== test session starts ====================
platform darwin -- Python 3.11.7, pytest-8.1.1, pluggy-1.4.0
rootdir: <project-rootdir>
collected 2 items

test_nyctaxi_functions.py .. [100%]
======================== 2 passed ==========================