approx_count_distinct aggregate function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Returns the estimated number of distinct values in expr within the group.

The implementation uses the dense version of the HyperLogLog++ (HLL++) algorithm, a state of the art cardinality estimation algorithm.

Results are accurate within a default value of 5%, which derives from the value of the maximum relative standard deviation, although this is configurable with the relativeSD parameter as mentioned below.

Syntax

approx_count_distinct(expr[, relativeSD]) [FILTER ( WHERE cond ) ]

This function can also be invoked as a window function using the OVER clause.

Arguments

  • expr: Can be of any type for which equivalence is defined.

  • relativeSD: Defines the maximum relative standard deviation allowed.

  • cond: An optional boolean expression filtering the rows used for aggregation.

Returns

A BIGINT.

Examples

> SELECT approx_count_distinct(col1) FROM VALUES (1), (1), (2), (2), (3) tab(col1);
 3

> SELECT approx_count_distinct(col1) FILTER(WHERE col2 = 10)
    FROM VALUES (1, 10), (1, 10), (2, 10), (2, 10), (3, 10), (1, 12) AS tab(col1, col2);
 3