SQL Pipeline Syntax

Applies to: check marked yes Databricks Runtime 16.2 and later

Databricks supports SQL pipeline syntax which allows composing queries from combinations of chained operators.

Any query can have zero or more pipe operators as a suffix, delineated by the pipe character |>.
Each piped operation starts with one or more SQL keywords followed by its own grammar.
Operators can apply in any order, any number of times.
Typically FROM relation_name is used to start a pipeline, but any query can start a pipeline.

Syntax

{ FROM | TABLE } relation_name { |> piped_operation } [ ...]

Parameters

relation_name

Identifies a table or view to be used as the input for the pipeline. If the table or view cannot be found, Databricks raises a TABLE_OR_VIEW_NOT_FOUND error.
piped_operation

An operation consuming the preceding query or pipe operation. See piped_operation for details.

Example

This is query 13 from the TPC-H benchmark written in ANSI SQL:

> SELECT c_count, COUNT(*) AS custdist
    FROM
    (SELECT c_custkey, COUNT(o_orderkey) c_count
      FROM customer
      LEFT OUTER JOIN orders ON c_custkey = o_custkey
                             AND o_comment NOT LIKE '%unusual%packages%'
     GROUP BY c_custkey
  ) AS c_orders
  GROUP BY c_count
  ORDER BY custdist DESC, c_count DESC;

To write the same logic using SQL pipe operators, you can express it like this:

> FROM customer
  |> LEFT OUTER JOIN orders ON c_custkey = o_custkey
                            AND o_comment NOT LIKE '%unusual%packages%'
  |> AGGREGATE COUNT(o_orderkey) c_count
     GROUP BY c_custkey
  |> AGGREGATE COUNT(*) AS custdist
     GROUP BY c_count
  |> ORDER BY custdist DESC, c_count DESC;