from_xml
function
Applies to: Databricks SQL Databricks Runtime 14.1 and above
Preview
This feature is in Public Preview.
Returns a struct value parsed from the xmlStr
using schema
.
Arguments
xmlStr
: ASTRING
expression specifying a single XML recordschema
: ASTRING
expression or invocation of schema_of_xml function.options
: An optionalMAP<STRING,STRING>
literal specifying directives.
Returns
A STRUCT
with field names and types matching the schema definition.
xmlStr
should be well-formed with respect to schema
and options
. If xmlStr
cannot be parsed NULL
is returned.
schema
must be defined as comma-separated column name and data type pairs as used in for example CREATE TABLE
.
options
, if provided, can be any of the following:
excludeAttribute
(defaultfalse
): Whether to exclude attributes in elements.mode
(defaultPERMISSIVE
): allows a mode for dealing with corrupt records during parsing.PERMISSIVE
: when it meets a corrupted record, puts the malformed string into a field configured bycolumnNameOfCorruptRecord
, and sets malformed fields to null. To keep corrupt records, you can set a string type field namedcolumnNameOfCorruptRecord
in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds acolumnNameOfCorruptRecord
field in an output schema.FAILFAST
: throws an exception when it meets corrupted records.
columnNameOfCorruptRecord
(default is the value specified inspark.sql.columnNameOfCorruptRecord
): allows renaming the new field having malformed string created byPERMISSIVE
mode. This overridesspark.sql.columnNameOfCorruptRecord
.inferSchema
(defaulttrue
): iftrue
, attempts to infer an appropriate type for each resulting attribute, like a boolean, numeric or date type. Iffalse
, all resulting columns are of string type.prefersDecimal
(defaultfalse
): infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.attributePrefix
(default_
): The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Can be an empty string.valueTag
(default_VALUE
): The tag used for the character data within elements that also have attribute(s) or child element(s) elements.encoding
(default UTF-8): decodes the XML files by the specified encoding type.ignoreSurroundingSpaces
(defaulttrue
): Defines whether surrounding whitespaces from values being read should be skipped.rowValidationXSDPath
: Path to an XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred.ignoreNamespace
(defaultfalse
): Iftrue
, namespaces prefixes on XML elements and attributes are ignored. Tags<abc:author>
and<def:author>
would, for example, be treated as if both are just<author>
. Note that namespaces cannot be ignored on therowTag
element, only its children. Note that XML parsing is in general not namespace-aware even if false.timestampFormat
(defaultyyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
): sets the string that indicates a timestamp format. Custom date formats follow the formats at Datetime patterns. This applies to timestamp type.timestampNTZFormat
(defaultyyyy-MM-dd'T'HH:mm:ss[.SSS]
): sets the string that indicates a timestamp without timezone format. Custom date formats follow the formats at Datetime patterns. This applies to TimestampNTZType type.dateFormat
(defaultyyyy-MM-dd
): sets the string that indicates a date format. Custom date formats follow the formats at Datetime patterns. This applies to date type.locale
(default isen-US
): sets a locale as language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.nullValue
(default isnull
): Sets the string representation of a null value.
Examples
> SELECT from_xml('<p><a>1</a><b>0.8</b></p>', 'a INT, b DOUBLE');
{"a":1,"b":0.8}
> SELECT from_xml('<p><time>26/08/2015</time></p>', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
{ "time": "2015-08-26T00:00:00.000+0000"}
> SELECT from_xml('<p><teacher>Alice</teacher><student><name>Bob</name><rank>1</rank></student><student><name>Charlie</name><rank>2</rank></student></p>',
'STRUCT<teacher: STRING, student: ARRAY<STRUCT<name: STRING, rank: INT>>>');
{"teacher":"Alice","student":[{"name":"Bob","rank":1},{"name":"Charlie","rank":2}]}