SQL Pipeline Syntax
Applies to: Databricks Runtime 16.2 and later
Databricks supports SQL pipeline syntax which allows composing queries from combinations of chained operators.
Any query can have zero or more pipe operators as a suffix, delineated by the pipe character
|>
.Each piped operation starts with one or more SQL keywords followed by its own grammar.
Operators can apply in any order, any number of times.
Typically FROM relation_name is used to start a pipeline, but any query can start a pipeline.
Parameters
-
Identifies a table or view to be used as the input for the pipeline. If the table or view cannot be found, Databricks raises a TABLE_OR_VIEW_NOT_FOUND error.
-
An operation consuming the preceding query or pipe operation. See piped_operation for details.
Example
This is query 13 from the TPC-H benchmark written in ANSI SQL:
> SELECT c_count, COUNT(*) AS custdist
FROM
(SELECT c_custkey, COUNT(o_orderkey) c_count
FROM customer
LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
GROUP BY c_custkey
) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;
To write the same logic using SQL pipe operators, you can express it like this:
> FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;