Collation

Preview

This feature is in Public Preview.

Applies to: check marked yes Databricks Runtime 16.1 and later

A collation is a set of rules that determines how string comparisons are performed. Collations are used to compare strings in a case-insensitive or accent-insensitive manner or to sort strings in a specific language-aware order.

Strings in Databricks are represented as UTF-8 encoded Unicode characters. By default Databricks compares strings by their binary UTF8 representation. This is known as UTF8_BINARY collation. UTF8_BINARY comparisons are fast and appropriate in many cases, but may not be suitable for all applications, especially those that require language-aware sorting or comparisons.

Aside from language-aware comparisons, a common use case is to require case-insensitive comparisons. Databricks has the UTF8_LCASE collation specifically for this purpose. It converts strings to lowercase before comparing them using the fast UTF8_BINARY collation.

For language-aware comparisons, Databricks employs the following technologies:

These technologies are encapsulated in a set of named collations that can be used in SQL statements.

Collation names

Because identifying collations by their LDML specification can be complex and challenging to read, Databricks has a set of easier-to-use named system collations.

Syntax

{ UTF8_BINARY |
  UTF8_LCASE |
  { UNICODE | locale } [ _ modifier [...] ] }

locale
  language_code [ _ script_code ] [ _ country_code ]

modifier
  { CS | CI | AS | AI }
  • UTF8_BINARY

    A meta-locale binary collation that compares strings byte by byte based on the UTF-8 byte representation. UTF8_BINARY is the default and most lightweight collation for string comparison in Databricks.

    In this collation ‘A’ (x’65’) < ‘B’ (x’66’) < … < ‘Z’ (x’90’).

    However, ‘Z’ (x’90’) < ‘a’ (x’97’), and ‘A’ (x’65’) <> ‘a’ (x’97’).

    Further, characters such as ‘Ä’ (x’C384’) are greater than ‘Z’ and ‘z’ in this collation.

  • UTF8_LCASE

    A lightweight meta-locale case-insensitive collation that compares strings using their UTF-8 byte representation after converting the strings to lower case.

    UTF8_LCASE is the collation used for identifiers in Databricks.

    For example:

    ORDER BY col COLLATE UTF8_LCASE
    

    is equivalent to

    ORDER BY LOWER(col) COLLATE UTF8_BINARY
    
  • UNICODE

    The ICU root locale.

    This collation, known in CLDR as the ‘root’ locale (LDML specification: ‘und-u’) imposes a language agnostic order, which tries to be intuitive overall. In this collation, like characters are grouped. For example: ‘a’ < ‘A’ < ‘Ä’ < ‘b’. ‘A’ is not considered equivalent to ‘a’. Therefore, the collation is case-sensitive. ‘a’ is not considered equivalent to ‘ä’. Therefore, the collation is accent-sensitive.

  • locale

    A locale-aware collation based on the CLDR tables.

    The locale is specified as a language code, an optional script code, and an optional country code. locale is case-insensitive.

  • modifier

    Specifies the collation behavior regarding case sensitivity and accent sensitivity.

    • CS: Case-sensitive. The default behavior.

    • CI: Case-insensitive.

    • AS: Accent-sensitive. The default behavior.

    • AI: Accent-insensitive.

    You can specify either CS or CI, and either AS or AI at most once and in any order. The modifiers themselves are case-insensitive.

When processing a collation, Databricks normalizes collation names by removing defaults. For example, SR_CYR_SRN_CS_AS is normalized to SR.

For a list of supported collations, see Supported collations.

Examples

-- You can fully qualify collations, and case doesn't matter.
system.builtin.unicode

-- Since all collations are system defined you don't need to qualify them
unicode

-- Using 2-letter language code only for german collation
DE

-- Using 2-letter language code and 3-letter country code for french-canadian collation
-- It is common to use lower case 2-letter language codes and upper case 3-letter country codes
-- But collation names are case insensitive
fr_CAN

-- Using 2-letter language code and 4-letter script code and 3-letter country code for traditional chinese in Macao
zh-Hant-MAC

-- Using a 2 letter german language code and 2 modifiers for case insensitive and accent insensitive
-- So 'Ä', 'A', and 'a' are all considered equal
de_CI_AI

-- Using back ticks is allowed, but unnecessary for builtin collations
`UTF8_BINARY`

Default collation

The default collation applies when using STRING literals, parameter markers, functions without STRING parameters producing strings, and when defining column, field or variable types without a COLLATE clause.

The default collation is derived in UTF8_BINARY.

Collation precedence

To decide which collation to use for a given string Databricks defines collation precedence rules.

The rules assign 4 levels of precedence to collations:

  1. Explicit

    The collation has been explicitly assigned to a string using COLLATE expression.

    Examples

    -- Force fast binary collation to check whether a vin matches a Ferrari
    vin COLLATE UTF8_BINARY LIKE 'ZFF%'
    
    -- Force German collation to order German first names
    ORDER BY vorname COLLATE DE
    
  2. Implicit

    The collation is implicitly assigned by the column, field, column-alias, variable, or routine parameter reference. This includes the result of a subquery as long as the collation is not None.

    Examples

    -- Use the collation of the column as it was defined
    employee.name LIKE 'Mc%'
    
    -- Use the collation of the variable as it was defined.
    translate(session.tempvar, 'Z', ',')
    
  3. Default

    A STRING literal, named or unnamed parameter marker, or a STRING produced by a function from another type.

    Examples

    -- A literal string has the default collation
    'Hello'
    
    -- :parm1 is a parameter marker using session default collation
    EXECUTE IMMEDIATE 'SELECT :parm1' USING 'Hello' AS parm1;
    
    -- ? is a parameter marker using session default collation
    EXECUTE IMMEDIATE 'SELECT ?' USING 'Hello';
    
    -- The result of a cast of a non-STRING to a STRING is a STRING with the default collation
    CAST(5 AS STRING)
    
    -- The date is converted to a string using the default collation
    to_char(DATE'2016-04-08', 'y')
    
    -- The collation of the session_user STRING is the default collation
    session_user()
    

    The assigned collation is the Default Collation.

  4. None

    A STRING result of a function, operator or set operation (e.g. UNION) that takes more than one STRING argument which have different implicit collations.

    Examples

    -- Concatenating two strings with different explicit collations results in no collation
    SELECT fr || de AS freutsch FROM VALUES('Voulez vous ' COLLATE FR), 'Kartoffelsupp...' COLLATE DE) AS T(fr, de)
    
    -- A union of two strings with different excplicit collations results in no collation
    SELECT 'Voulez vous ' COLLATE FR UNION ALL SELECT 'Kartoffelsupp...' COLLATE DE
    

Collation derivation

When deriving the collation for a STRING result, the collation precedence rules are applied in the following ways:

If the expression:

  1. matches the definitions above

    The collation and precedence is as defined.

  2. is a function or operator with a single STRING parameter, returning a STRING

    The collation and precedence is that of the STRING parameter.

  3. is a function or operator with two or more STRING parameters

    1. with the same collations and precedence

      The collation and precedence is that of the STRING parameters.

    2. with different collations or precedence

      Let C1 and C2 be distinct collations and let D be the default collation. The precedence and the collation is determined by the following table:

      Collation and Precedence

      C1 Explicit

      C1 Implicit

      D Default

      None

      C2 Explicit

      Error

      C2 Explicit

      C2 Explicit

      C2 Explicit

      C2 Implicit

      Explicit C1

      None

      C2 Implicit

      None

      D Default

      C1 Explicit

      C1 Implicit

      D Default

      None

      None

      C1 Explicit

      None

      None

      None

Examples

> CREATE TABLE words(fr STRING COLLATE FR, de STRING COLLATE DE, en STRING COLLATE EN);
> INSERT INTO words VALUES ('Salut', 'Hallo', 'Hello');

-- A literal string has the default collation
> SELECT collation('Ciao');
  UTF8_BINARY

-- A function producing a STRING has the default collation
> SELECT collation(user());
  UTF8_BINARY

-- Function modifying a STRING passes the collation through
> SELECT collation(upper('Ciao'));
  UTF8_BINARY

-- Implicit collation (French) wins over default collation
> SELECT collation(fr || 'Ciao') FROM words;
  FR

-- Explicit collation (French) wins over implicit collation (German)
> SELECT collation('Salut' COLLATE FR || de) FROM words;
  FR

-- Implicit collation German collides with implicit collation French
-- The result is no collation
> SELECT collation(de || fr) FROM words;
  null

-- Explicit collation (French) wins over default collation (Italian)
> SELECT collation('Salut' COLLATE FR || 'Ciao');
  FR

-- Explicit collation (French) collides with explicit collation (German)
> SELECT collation('Salut' COLLATE FR || 'Hallo' COLLATE DE);
  COLLATION_MISMATCH.EXPLICIT

-- The explicit collation wins over no collation
> SELECT collation('Ciao' COLLATE IT || (fr || de)) FROM words;
  UTF8_BINARY

-- The implict collation (English) does not win over None
> SELECT collation(en || (fr || de)) FROM words;
  null

-- The explicit collation (English) wins over Implicit collation anywhere in the expression
> SELECT collation((fr || ltrim('H' COLLATE EN, fr)) || fr) FROM words;
  EN