Piped operation

Applies to: check marked yes Databricks Runtime 16.2 and later

Processes the result of the preceding query using a chained operation.

Syntax

{ SELECT clause |
  EXTEND { expression [ [ AS ] column_alias ] } [ , ...] |
  SET { column_name = expression } [, ...] |
  DROP column_name [, ...] |
  AS table_alias |
  WHERE clause |
  { LIMIT clause |
    OFFSET clause |
    LIMIT clause OFFSET clause } |
  aggregation |
  JOIN clause |
  ORDER BY clause |
  set_operator |
  TABLESAMPLE clause
  PIVOT clause
  UNPIVOT clause }

 aggregation
   AGGREGATE aggregate_expr [ [ AS ] column_alias ] [, ...]
   [ GROUP BY grouping_expr [AS column_alias ] ]

Parameters

  • SELECT clause

    Collects the columns to be returned from the query, including the execution of expressions, and deduplication.

    The column list must not contain aggregate functions. Use the AGGREGATE operation for that purpose.

  • EXTEND

    Appends new columns to the query select list.

    • expression

      A combination of one or more values, operators, and SQL functions that evaluates to a value.

      expression may contain references to columns in the query select list, as well as preceding column_alias in this EXTEND clause.

    • column_alias

      An optional column identifier naming the expression result. If no column_alias is provided Databricks derives one.

  • SET

    Overrides existing columns in the query select list with new values.

    The operation is performed in the order of appearance in the SET clause. The result of any expression can observe the columns updated by preceding expressions.

    • column_name

      The name of the column to be updated. If the column does not exist, Databricks raises a UNRESOLVED_COLUMN error.

    • expression

      A combination of one or more values, operators, and SQL functions that evaluates to a value.

  • DROP column_name [, …]

    Removes columns from the query select list.

    If the column does not exist, Databricks raises a UNRESOLVED_COLUMN error.

  • AS table_alias

    Assigns an name to the result of the query.

  • WHERE

    Filters the result of the query based on the supplied predicates.

  • LIMIT

    Limits the maximum number of rows that can be returned by the query. This clause usually follows an ORDER BY to produce a deterministic result.

  • OFFSET

    Skips a number of rows returned by the query. This clause is typically used in conjunction with LIMIT to page through a result set, and ORDER BY to produce a deterministic result.

    Note

    When paging through a result set using LIMIT and OFFSET all rows are processed, including skipped rows. However, only the specified subset of rows is returned in the result set. Pagination with this technique is not advised for resource-intensive queries.

  • aggregation

    Aggregates the result set of the query based on the supplied expressions and optional grouping expressions.

    This operation produces a result set where the grouping columns appear before the aggregated columns.

    • AGGREGATE

      Specifies the expressions to aggregate.

    • GROUP BY

      Specifies by which expressions the rows are grouped. If not specified, all rows are treated as a single group.

      • grouping_expr

        An expression identifying the grouping columns. See GROUP BY for more information.

        Unlike a generic GROUP BY clause, an integral numeric value identifies a column in the query that supplies the input, and not the generated result set.

    • column_alias

      An optional column identifier naming the expression result. If no column_alias is provided Databricks derives one.

  • JOIN

    Combines two or more relations using a join. See JOIN for more information.

  • ORDER BY

    Orders the rows of the result set of the query. The output rows are ordered across the partitions. This parameter is mutually exclusive with SORT BY, CLUSTER BY, and DISTRIBUTE BY and cannot be specified together.

  • set_operator

    Combines the query with one or more subqueries using UNION, EXCEPT, or INTERSECT operators.

  • TABLESAMPLE

    Reduces the size of the result set by only sampling a fraction of the rows.

  • PIVOT

    Used for data perspective. You can get the aggregated values based on specific column values. See PIVOT for more information.

  • UNPIVOT

    Used for data perspective. You can split multiple column groups into rows. See UNPIVOT for more information.

Examples

-- This query
> FROM customer
  |> LEFT OUTER JOIN orders ON c_custkey = o_custkey
     AND o_comment NOT LIKE '%unusual%packages%'
  |> AGGREGATE COUNT(o_orderkey) c_count
     GROUP BY c_custkey
  |> AGGREGATE COUNT(*) AS custdist
     GROUP BY c_count
  |> ORDER BY custdist DESC, c_count DESC;

is equivalent to:

> SELECT c_count, COUNT(*) AS custdist
  FROM
    (SELECT c_custkey, COUNT(o_orderkey) c_count
     FROM customer
     LEFT OUTER JOIN orders ON c_custkey = o_custkey
       AND o_comment NOT LIKE '%unusual%packages%' GROUP BY c_custkey
    ) AS c_orders
  GROUP BY c_count
  ORDER BY custdist DESC, c_count DESC;

-- Using the SELECT clause following a FROM clause
> CREATE TABLE t AS VALUES (0), (1) AS t(col);

> FROM t
  |> SELECT col * 2 AS result;
 result
 ------
      0
      2

-- Adding columns to the result set
> VALUES (0), (1) tab(col)
  |> EXTEND col * 2 AS result;
 col result
 --- ------
   0      0
   1      2

-- Replacing an expression
> VALUES (0), (1) tab(col)
  |> SET col = col * 2;
 col
 ---
   0
   2

-- Removing a column from the result set
> VALUES (0, 1) tab(col1, col2)
  |> DROP col1;
 col2
 ----
    1

-- Using a table alias
> VALUES (0, 1) tab(col1, col2)
  |> AS new_tab
  |> SELECT col1 + col2 FROM new_tab;
  col1 + col2
            1

-- Filtering the result set
> VALUES (0), (1) tab(col)
  |> WHERE col = 1;
 col
 ---
   1

-- Using LIMIT to truncate the result
> VALUES (0), (0) tab(col)
  |> LIMIT 1;
 col
 ---
   0

-- Full-table aggregation
> VALUES (0), (1) tab(col)
  |> AGGREGATE COUNT(col) AS count;
 count
 -----
     2

-- Aggregation with grouping
> VALUES (0, 1), (0, 2) tab(col1, col2)
  |> AGGREGATE COUNT(col2) AS count GROUP BY col1;
 col1 count
 ---- -----
    0     2

-- Using JOINs
> SELECT 0 AS a, 1 AS b
  |> AS lhs
  |> JOIN VALUES (0, 2) rhs(a, b) ON (lhs.a = rhs.a);
   a   b   c   d
 --- --- --- ---
   0   1   0   2

> VALUES ('apples', 3), ('bananas', 4) t(item, sales)
  |> AS produce_sales
  |> LEFT JOIN
       (SELECT "apples" AS item, 123 AS id) AS produce_data
       USING (item)
  |> SELECT produce_sales.item, sales, id;
   item      sales   id
  --------- ------- ------
  apples    3       123
  bananas   4       NULL

-- Using ORDER BY
> VALUES (0), (1) tab(col)
  |> ORDER BY col DESC;
 col
 ---
   1
   0

> VALUES (0), (1) tab(a, b)
  |> UNION ALL VALUES (2), (3) tab(c, d);
   a    b
 --- ----
   0    1
   2    3

-- Sampling the result set
> VALUES (0), (0), (0), (0) tab(col)
  |> TABLESAMPLE (1 ROWS);
 col
 ---
   0

> VALUES (0), (0) tab(col)
  |> TABLESAMPLE (100 PERCENT);
 col
 ---
   0
   0

-- Pivoting a query
> VALUES
    ("dotNET", 2012, 10000),
    ("Java", 2012, 20000),
    ("dotNET", 2012, 5000),
    ("dotNET", 2013, 48000),
    ("Java", 2013, 30000)
    AS courseSales(course, year, earnings)
  |> PIVOT (
       SUM(earnings)
       FOR COURSE IN ('dotNET', 'Java')
    )
 year dotNET   Java
 ---- ------ ------
 2012  15000  20000
 2013  48000  30000

-- Using UNPIVOT
> VALUES
    ("dotNET", 2012, 10000),
    ("Java", 2012, 20000),
    ("dotNET", 2012, 5000),
    ("dotNET", 2013, 48000),
    ("Java", 2013, 30000)
  AS courseSales(course, year, earnings)
  |> UNPIVOT (
    earningsYear FOR `year` IN (`2012`, `2013`, `2014`)
   course   year earnings
 -------- ------ --------
     Java   2012    20000
     Java   2013    30000
   dotNET   2012    15000
   dotNET   2013    48000
   dotNET   2014    22500