`approx_count_distinct` aggregate function

Applies to: check marked yes Databricks SQL Databricks Runtime

Returns the estimated number of distinct values in expr within the group.

The implementation uses the dense version of the HyperLogLog++ (HLL++) algorithm, a state of the art cardinality estimation algorithm.

Results are accurate within a default value of 5%, which derives from the value of the maximum relative standard deviation, although this is configurable with the relativeSD parameter as mentioned below.

Syntax

approx_count_distinct(expr[, relativeSD]) [FILTER ( WHERE cond ) ]

This function can also be invoked as a window function using the OVER clause.

Arguments

expr: Can be of any type for which equivalence is defined.
relativeSD: Defines the maximum relative standard deviation allowed.
cond: An optional boolean expression filtering the rows used for aggregation.

Returns

A BIGINT.

Examples

> SELECT approx_count_distinct(col1) FROM VALUES (1), (1), (2), (2), (3) tab(col1);
 3

> SELECT approx_count_distinct(col1) FILTER(WHERE col2 = 10)
    FROM VALUES (1, 10), (1, 10), (2, 10), (2, 10), (3, 10), (1, 12) AS tab(col1, col2);
 3

approx_count_distinct aggregate function

Syntax

Arguments

Returns

Examples

Related functions

`approx_count_distinct` aggregate function