How is variant different than JSON strings?

Preview

This article describes the behavior changes and differences in syntax and semantics when working with the variant data type. This article assumes that you are familiar with working with JSON string data on Databricks. For users new to Databricks, you should use variant over JSON strings whenever storing semi-structured data that requires flexibility for changing or unknown schema. See Model semi-structured data.

In Databricks Runtime 15.3 and above, you can use the variant data type to encode and query semi-structured data. Databricks recommends variant as a replacement for storing semi-structured data using JSON strings. The improved read and write performance for variant allows it to replace native Spark complex types such as structs and arrays in some use cases.

How do you query variant data?

Variant data uses the same operators to query fields, subfields, and array elements.

To query a field, use : . For example, column_name:field_name.

To query a subfield, use . . For example, column_name:field_name.subfield_name.

To query an array element, use [n] where n is the integer index value of the element. For example, to query the first value in an array, column_name:array_name[0].

The following differences might break existing queries when upgrading from JSON strings to variant:

All variant path elements are matched in a case-sensitive way. JSON strings are case-insensitive. This means that for variant, column_name:FIELD_NAME and column_name:field_name look for different fields in the stored data.
The [*] syntax is not support for identifying or unpacking all elements in an array.
Variant encodes NULL values differently than JSON strings. See Variant null rules.
Variant columns have limitations for some operations. See Limitations.

Convert JSON strings to and from variant

In Databricks Runtime 15.3 and above, the to_json function has additional functionality to cast VARIANT types to JSON strings. Options are ignored when converting VARIANT to JSON string. See to_json.

The parse_json function transforms a JSON string to VARIANT type. While parse_json(json_string_column) is the logical inverse of to_json(variant_column), the following conversion rules describe why it is not the exact inverse:

Whitespace is not perfectly preserved.
Ordering of keys is arbitrary.
Trailing zeros in numbers might be truncated.

The parse_json function returns an error if the JSON string is malformed or exceeds the variant size limit. Use the try_parse_json function to instead return a NULL when an error in parsing occurs.

What are the SQL functions for working with variants?

Apache Spark SQL functions available in Databricks Runtime 15.3 and above provide methods for interacting with variant data. The following table includes the new function, the corresponding JSON string function, and notes on differences in behavior.

Note

To use these functions with PySpark DataFrames, import them from pyspark.sql.functions. variant_explode and variant_explode_outer are not supported in PySpark.

Variant function	JSON string function	Notes
variant_get	cast and get_json_object	Takes an expression, path, and type. Follows all rules for variants paths, casting, and nulls.
try_variant_get	try_cast and get_json_object	Takes an expression, path, and type. Follows all rules for variants paths, casting, and nulls.
is_variant_null	is null	Checks whether the expression is storing a `VARIANT` encoded `NULL`. Use `is null` to check if the input expression is `NULL`.
schema_of_variant	schema_of_json	When determining the schema for an `ARRAY<elementType>`, the `elementType` might be inferred as `VARIANT` if there are conflicting types found in the data.
schema_of_variant_agg	schema_of_json_agg	When no least common type is identified, the type is derived as `VARIANT`.
variant_explode	explode	Outputs `pos`, `key`, and `value` columns. When exploding an array, the output key is always null.
variant_explode_outer	explode_outer	Outputs `pos`, `key`, and `value` columns. When exploding an array, the output key is always null.

Variants handle casting and NULLs differently than JSON strings. See Variant type casting rules and Variant null rules.