approx_top_k
aggregate function
Applies to: Databricks SQL Databricks Runtime 10.4 LTS and above
Returns the top k
most frequently occurring item values in an expr
along with their approximate counts.
Syntax
approx_top_k(expr[, k[, maxItemsTracked]]) [FILTER ( WHERE cond ) ]
This function can also be invoked as a window function using the OVER
clause.
Arguments
expr
: An expression of STRING, BOOLEAN, DATE, TIMESTAMP, or numeric type.k
: An optional INTEGER literal greater than 0. Ifk
is not specified, it defaults to5
.maxItemsTracked
: An optional INTEGER literal greater than or equal tok
. IfmaxItemsTracked
is not specified, it defaults to10000
.cond
: An optional boolean expression filtering the rows used for aggregation.
Returns
Results are returned as an ARRAY of type STRUCT, where each STRUCT contains an item
field for the value (with its
original input type) and a count
field (of type LONG) with the approximate number of occurrences. The array is sorted
by count
descending.
The aggregate function returns the top k
most frequently occurring item values in an expression expr
along with
their approximate counts. The error in each count may be up to 2.0 * numRows / maxItemsTracked
where numRows
is the
total number of rows. Higher values of maxItemsTracked
provide better accuracy at the cost of increased memory usage.
Expressions that have fewer than maxItemsTracked
distinct items will yield exact item counts. Results include NULL
values as their own item in the results.
Examples
> SELECT approx_top_k(expr) FROM VALUES (0), (0), (1), (1), (2), (3), (4), (4) AS tab(expr);
[{'item':4,'count':2},{'item':1,'count':2},{'item':0,'count':2},{'item':3,'count':1},{'item':2,'count':1}]
> SELECT approx_top_k(expr, 2) FROM VALUES 'a', 'b', 'c', 'c', 'c', 'c', 'd', 'd' AS tab(expr);
[{'item':'c','count',4},{'item':'d','count':2}]
> SELECT approx_top_k(expr, 10, 100) FROM VALUES (0), (1), (1), (2), (2), (2) AS tab(expr);
[{'item':2,'count':3},{'item':1,'count':2},{'item':0,'count':1}]