Piped operation
Applies to: Databricks Runtime 16.2 and later
Processes the result of the preceding query using a chained operation.
Syntax
{ SELECT clause |
EXTEND { expression [ [ AS ] column_alias ] } [ , ...] |
SET { column_name = expression } [, ...] |
DROP column_name [, ...] |
AS table_alias |
WHERE clause |
{ LIMIT clause |
OFFSET clause |
LIMIT clause OFFSET clause } |
aggregation |
JOIN clause |
ORDER BY clause |
set_operator |
TABLESAMPLE clause
PIVOT clause
UNPIVOT clause }
aggregation
AGGREGATE aggregate_expr [ [ AS ] column_alias ] [, ...]
[ GROUP BY grouping_expr [AS column_alias ] ]
Parameters
-
Collects the columns to be returned from the query, including the execution of expressions, and deduplication.
The column list must not contain aggregate functions. Use the
AGGREGATE
operation for that purpose. EXTEND
Appends new columns to the query select list.
-
A combination of one or more values, operators, and SQL functions that evaluates to a value.
expression
may contain references to columns in the query select list, as well as precedingcolumn_alias
in thisEXTEND
clause. -
An optional column identifier naming the expression result. If no
column_alias
is provided Databricks derives one.
-
SET
Overrides existing columns in the query select list with new values.
The operation is performed in the order of appearance in the
SET
clause. The result of any expression can observe the columns updated by preceding expressions.-
The name of the column to be updated. If the column does not exist, Databricks raises a UNRESOLVED_COLUMN error.
expression
A combination of one or more values, operators, and SQL functions that evaluates to a value.
-
DROP column_name [, …]
Removes columns from the query select list.
If the column does not exist, Databricks raises a UNRESOLVED_COLUMN error.
AS table_alias
Assigns an name to the result of the query.
-
Filters the result of the query based on the supplied predicates.
-
Limits the maximum number of rows that can be returned by the query. This clause usually follows an ORDER BY to produce a deterministic result.
-
Skips a number of rows returned by the query. This clause is typically used in conjunction with LIMIT to page through a result set, and ORDER BY to produce a deterministic result.
aggregation
Aggregates the result set of the query based on the supplied expressions and optional grouping expressions.
This operation produces a result set where the grouping columns appear before the aggregated columns.
AGGREGATE
Specifies the expressions to aggregate.
-
An expression containing one or more aggregate functions. See GROUP BY for more information.
-
GROUP BY
Specifies by which expressions the rows are grouped. If not specified, all rows are treated as a single group.
-
An optional column identifier naming the expression result. If no
column_alias
is provided Databricks derives one.
-
Combines two or more relations using a join. See JOIN for more information.
-
Orders the rows of the result set of the query. The output rows are ordered across the partitions. This parameter is mutually exclusive with
SORT BY
,CLUSTER BY
, andDISTRIBUTE BY
and cannot be specified together. -
Combines the query with one or more subqueries using
UNION
,EXCEPT
, orINTERSECT
operators. -
Reduces the size of the result set by only sampling a fraction of the rows.
-
Used for data perspective. You can get the aggregated values based on specific column values. See PIVOT for more information.
-
Used for data perspective. You can split multiple column groups into rows. See UNPIVOT for more information.
Examples
-- This query
> FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;
is equivalent to:
> SELECT c_count, COUNT(*) AS custdist
FROM
(SELECT c_custkey, COUNT(o_orderkey) c_count
FROM customer
LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%' GROUP BY c_custkey
) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;
-- Using the SELECT clause following a FROM clause
> CREATE TABLE t AS VALUES (0), (1) AS t(col);
> FROM t
|> SELECT col * 2 AS result;
result
------
0
2
-- Adding columns to the result set
> VALUES (0), (1) tab(col)
|> EXTEND col * 2 AS result;
col result
--- ------
0 0
1 2
-- Replacing an expression
> VALUES (0), (1) tab(col)
|> SET col = col * 2;
col
---
0
2
-- Removing a column from the result set
> VALUES (0, 1) tab(col1, col2)
|> DROP col1;
col2
----
1
-- Using a table alias
> VALUES (0, 1) tab(col1, col2)
|> AS new_tab
|> SELECT col1 + col2 FROM new_tab;
col1 + col2
1
-- Filtering the result set
> VALUES (0), (1) tab(col)
|> WHERE col = 1;
col
---
1
-- Using LIMIT to truncate the result
> VALUES (0), (0) tab(col)
|> LIMIT 1;
col
---
0
-- Full-table aggregation
> VALUES (0), (1) tab(col)
|> AGGREGATE COUNT(col) AS count;
count
-----
2
-- Aggregation with grouping
> VALUES (0, 1), (0, 2) tab(col1, col2)
|> AGGREGATE COUNT(col2) AS count GROUP BY col1;
col1 count
---- -----
0 2
-- Using JOINs
> SELECT 0 AS a, 1 AS b
|> AS lhs
|> JOIN VALUES (0, 2) rhs(a, b) ON (lhs.a = rhs.a);
a b c d
--- --- --- ---
0 1 0 2
> VALUES ('apples', 3), ('bananas', 4) t(item, sales)
|> AS produce_sales
|> LEFT JOIN
(SELECT "apples" AS item, 123 AS id) AS produce_data
USING (item)
|> SELECT produce_sales.item, sales, id;
item sales id
--------- ------- ------
apples 3 123
bananas 4 NULL
-- Using ORDER BY
> VALUES (0), (1) tab(col)
|> ORDER BY col DESC;
col
---
1
0
> VALUES (0), (1) tab(a, b)
|> UNION ALL VALUES (2), (3) tab(c, d);
a b
--- ----
0 1
2 3
-- Sampling the result set
> VALUES (0), (0), (0), (0) tab(col)
|> TABLESAMPLE (1 ROWS);
col
---
0
> VALUES (0), (0) tab(col)
|> TABLESAMPLE (100 PERCENT);
col
---
0
0
-- Pivoting a query
> VALUES
("dotNET", 2012, 10000),
("Java", 2012, 20000),
("dotNET", 2012, 5000),
("dotNET", 2013, 48000),
("Java", 2013, 30000)
AS courseSales(course, year, earnings)
|> PIVOT (
SUM(earnings)
FOR COURSE IN ('dotNET', 'Java')
)
year dotNET Java
---- ------ ------
2012 15000 20000
2013 48000 30000
-- Using UNPIVOT
> VALUES
("dotNET", 2012, 10000),
("Java", 2012, 20000),
("dotNET", 2012, 5000),
("dotNET", 2013, 48000),
("Java", 2013, 30000)
AS courseSales(course, year, earnings)
|> UNPIVOT (
earningsYear FOR `year` IN (`2012`, `2013`, `2014`)
course year earnings
-------- ------ --------
Java 2012 20000
Java 2013 30000
dotNET 2012 15000
dotNET 2013 48000
dotNET 2014 22500