User-defined functions in Databricks Connect for Python

Note

This article covers Databricks Connect for Databricks Runtime 13.1 and above.

This article describes how to execute UDFs with Databricks Connect for Python. Databricks Connect enables you to connect popular IDEs, notebook servers, and custom applications to Databricks clusters. For the Scala version of this article, see User-defined functions in Databricks Connect for Scala.

Note

Before you begin to use Databricks Connect, you must set up the Databricks Connect client.

Databricks Connect for Python supports user-defined functions (UDF). When a Dataframe operation that include UDFs is executed, the UDFs involved are serialized by Databricks Connect and sent over to the server as part of the request.

Note

Since the user-defined function is serialized and deserialized, the Python version used by the client must match the Python version on the Databricks cluster. To check the cluster’s Python version, see the “System Environment” section for the cluster’s Databricks Runtime version in Databricks Runtime release notes versions and compatibility.

The following Python program sets up a simple UDF that squares values in a column.

from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from databricks.connect import DatabricksSession

@udf(returnType=IntegerType())
def double(x):
    return x * x


spark = DatabricksSession.builder.getOrCreate()

df = spark.range(1, 2)
df = df.withColumn("doubled", double(col("id")))

df.show()