What are user-defined functions (UDFs)?

A user-defined function (UDF) is a function defined by a user, allowing custom logic to be reused in the user environment. Databricks has support for many different types of UDFs to allow for distributing extensible logic. This article introduces some of the general strengths and limitations of UDFs.

See the following articles for more information on UDFs:

When is custom logic not a UDF?

Not all custom functions are UDFs in the strict sense. You can safely define a series of Spark built-in methods using SQL or Spark DataFrames and get fully optimized behavior. For example, the following SQL and Python functions combine Spark built-in methods to define a unit conversion as a reusable function:

CREATE FUNCTION convert_f_to_c(unit STRING, temp DOUBLE)
RETURNS DOUBLE
RETURN CASE
  WHEN unit = "F" THEN (temp - 32) * (5/9)
  ELSE temp
END;

SELECT convert_f_to_c(unit, temp) AS c_temp
FROM tv_temp;
def convertFtoC(unitCol, tempCol):
  from pyspark.sql.functions import when
  return when(unitCol == "F", (tempCol - 32) * (5/9)).otherwise(tempCol)

from pyspark.sql.functions import col

df_query = df.select(convertFtoC(col("unit"), col("temp"))).toDF("c_temp")
display(df_query)

To run the above UDFs, you can create example data.

Which UDFs are most efficient?

UDFs might introduce significant processing bottlenecks into code execution. Databricks uses a number of different optimizers automatically for code written with included Apache Spark, SQL, and Delta Lake syntax. When custom logic is introduced by UDFs, these optimizers do not have the ability to efficiently plan tasks around this custom logic. In addition, logic that executes outside the JVM has additional costs around data serialization.

Some UDFs are more efficient than others. In terms of performance:

  • Built in functions will be fastest because of Databricks optimizers.

  • Code that executes in the JVM (Scala, Java, Hive UDFs) will be faster than Python UDFs.

  • Pandas UDFs use Arrow to reduce serialization costs associated with Python UDFs.

  • Python UDFs should generally be avoided, but Python can be used as glue code without any degradation in performance.

Type

Optimized

Execution environment

Hive UDF

No

JVM

Python UDF

No

Python

Pandas UDF

No

Python (Arrow)

Scala UDF

No

JVM

Spark SQL

Yes

JVM

Spark DataFrame

Yes

JVM

When should you use a UDF?

A major benefit of UDFs is that they allow users to express logic in familiar languages, reducing the human cost associated with refactoring code. For ad hoc queries, manual data cleansing, exploratory data analysis, and most operations on small or medium-sized datasets, latency overhead costs associated with UDFs are unlikely to outweigh costs associated with refactoring code.

For ETL jobs, streaming operations, operations on very large datasets, or other workloads that are executed regularly or continuously, refactoring logic to use native Apache Spark methods quickly pays dividends.

Example data for example UDFs

The code examples in this article use UDFs to convert temperatures between Celcius and Farenheit. If you wish to execute these functions, you can create a sample dataset with the following Python code:

import numpy as np
import pandas as pd

Fdf = pd.DataFrame(np.random.normal(55, 25, 10000000), columns=["temp"])
Fdf["unit"] = "F"

Cdf = pd.DataFrame(np.random.normal(10, 10, 10000000), columns=["temp"])
Cdf["unit"] = "C"

df = spark.createDataFrame(pd.concat([Fdf, Cdf]).sample(frac=1))

df.cache().count()
df.createOrReplaceTempView("tv_temp")