`from_xml` function

Applies to: check marked yes Databricks SQL Databricks Runtime 14.1 and above

Preview

Returns a struct value parsed from the xmlStr using schema.

Syntax

from_xml(xmlStr, schema [, options])

Arguments

xmlStr: A STRING expression specifying a single XML record
schema: A STRING expression or invocation of schema_of_xml function.
options: An optional MAP<STRING,STRING> literal specifying directives.

Returns

A STRUCT with field names and types matching the schema definition.

xmlStr should be well-formed with respect to schema and options. If xmlStr cannot be parsed NULL is returned.

schema must be defined as comma-separated column name and data type pairs as used in for example CREATE TABLE.

options, if provided, can be any of the following:

excludeAttribute (default false): Whether to exclude attributes in elements.
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.
- PERMISSIVE: when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets malformed fields to null. To keep corrupt records, you can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a columnNameOfCorruptRecord field in an output schema.
- FAILFAST: throws an exception when it meets corrupted records.
columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.
inferSchema (default true): if true, attempts to infer an appropriate type for each resulting attribute, like a boolean, numeric or date type. If false, all resulting columns are of string type.
prefersDecimal (default false): infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.
attributePrefix (default _): The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Can be an empty string.
valueTag (default _VALUE): The tag used for the character data within elements that also have attribute(s) or child element(s) elements.
encoding (default UTF-8): decodes the XML files by the specified encoding type.
ignoreSurroundingSpaces (default true): Defines whether surrounding whitespaces from values being read should be skipped.
rowValidationXSDPath: Path to an XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred.
ignoreNamespace (default false): If true, namespaces prefixes on XML elements and attributes are ignored. Tags <abc:author> and <def:author> would, for example, be treated as if both are just <author>. Note that namespaces cannot be ignored on the rowTag element, only its children. Note that XML parsing is in general not namespace-aware even if false.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]): sets the string that indicates a timestamp format. Custom date formats follow the formats at Datetime patterns. This applies to timestamp type.
timestampNTZFormat (default yyyy-MM-dd'T'HH:mm:ss[.SSS]): sets the string that indicates a timestamp without timezone format. Custom date formats follow the formats at Datetime patterns. This applies to TimestampNTZType type.
dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at Datetime patterns. This applies to date type.
locale (default is en-US): sets a locale as language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.
nullValue (default is null): Sets the string representation of a null value.

Examples

> SELECT from_xml('<p><a>1</a><b>0.8</b></p>', 'a INT, b DOUBLE');
 {"a":1,"b":0.8}

> SELECT from_xml('<p><time>26/08/2015</time></p>', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
 { "time": "2015-08-26T00:00:00.000+0000"}

> SELECT from_xml('<p><teacher>Alice</teacher><student><name>Bob</name><rank>1</rank></student><student><name>Charlie</name><rank>2</rank></student></p>',
                  'STRUCT<teacher: STRING, student: ARRAY<STRUCT<name: STRING, rank: INT>>>');
 {"teacher":"Alice","student":[{"name":"Bob","rank":1},{"name":"Charlie","rank":2}]}

from_xml function