Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.ocient.com/llms.txt

Use this file to discover all available pages before exploring further.

The HyperLogLog (HLL) sketch family of functions provides an approximate count of the number of unique elements in one or more columns. HLL functionality is similar to running a query using a COUNT(DISTINCT col) clause and uses the same mechanisms as the aggregate function APPROX_COUNT_DISTINCT. The tradeoff for HLL sketches is accuracy. For further explanation of this tradeoff in accuracy, see the DataSketches documentation. Each sketch is an approximate and compact representation of the original data, and it introduces a small margin of error. The HLL implementation is based on the algorithm outlined in the research paper HyperLogLog in Practice. This implementation uses log2k = 11 by default, which provides a percent error of approximately 5% at the 95% confidence interval.

HLL Sketch Functions in the SQL Statement

Sketches are useful on aggregated tables that require estimates of distinct counts. This example creates a table for the new sketch columns ip_address_sketch and user_id_sketch as part of a CREATE TABLE AS SELECT statement. The new table inserts values from the master_agg_table table.
SQL
CREATE TABLE my_sketch_table (
   col_a INT,
   col_b VARCHAR(255),
   ip_address_sketch HLL_SKETCH(11),
   user_id_sketch HLL_SKETCH(11)
 )
AS (
    SELECT
    col_a,
    col_b,
    HLL_SKETCH_CREATE(ip_address) as ip_address_sketch,
    HLL_SKETCH_CREATE(user_id) as user_id_sketch
FROM
    master_agg_table
    GROUP by 1,2);
After the database creates the new table, you can query for the approximate count of distinct values. This example merges the sketch columns by using an HLL_SKETCH_UNION function. After the database merges the columns, the database converts sketch values to a distinct count estimate by using the HLL_SKETCH_GET_ESTIMATE function.
SQL
SELECT
    HLL_SKETCH_GET_ESTIMATE(HLL_SKETCH_UNION(ip_address_sketch)) AS ip_address_sketch,
    HLL_SKETCH_GET_ESTIMATE(HLL_SKETCH_UNION(user_id_sketch)) AS user_id_sketch
FROM
    my_sketch_table;

HLL Accuracy with log2k Parameter

In addition to the default log2k = 11 implementation, you can specify a log2k value in the range of [10, 16] when creating an HLL Sketch. This parameter controls the number of buckets used in the HLL algorithm, which controls the accuracy of the sketch upper bounded by 1.04 / sqrt(k). The tradeoff with larger log2k values is that the sketch size grows exponentially. See this table for specifics on how the precision values affect size and accuracy.
PrecisionUncompressed Size95% CI
101032 B±6.50%
11 (default)2056 B±4.60%
124104 B±3.25%
138200 B±2.30%
1416392 B±1.63%
1532776 B±1.15%
1665544 B±0.81%
By default, log2k values larger than 11 are compressed on disk using ZSTD compression.
When creating or referencing a HLL Sketch column in a CREATE TABLE, CTAS or IAS statement, you can use the type alias HLL_SKETCH(log2k) instead of the internal HASH((2^log2k) + 8).This type alias is supported in EXPORT statements as well.
Example In this example, a CREATE TABLE statement creates two HLL Sketch columns with specified log2k precision values.
  • sketch_a has a precision value of 10, meaning it has low precision and low storage requirements.
  • sketch_b has a precision value of 15, meaning it has high precision and high storage requirements.
Note that the precision values specified in both HLL_SKETCH_CREATE statements match the precision values in the CREATE TABLE columns.
SQL
CREATE TABLE my_sketch_table (
    sketch_a HLL_SKETCH(10),
    sketch_b HLL_SKETCH(15)
 )
AS (
    SELECT
    HLL_SKETCH_CREATE(col_a, 10) as sketch_a,
    HLL_SKETCH_CREATE(col_b, 15) as sketch_b
FROM
    master_agg_table
    GROUP by 1,2);

Supported HLL Sketch Functions

HLL_SKETCH_CREATE

Creates an HLL sketch from the data on a specified aggregated column. Returns a HASH((2^log2k) + 8) data representation of the sketch that you can store in a separate column.
The HLL algorithm depends on the hash value of its input. When merging multiple sketches, it is important that each sketch derives from the same type in order to get accurate results. 1::BIGINT might hash differently than 1::INT, and thus, the HLL algorithm might not view these two hashes as referring to the same value 1.For accurate results, users should cast numeric columns or expressions to a consistent type before they are used in a HLL_SKETCH_CREATE function, especially if they will be merged later on.
Syntax
SQL
HLL_SKETCH_CREATE(agg_col, [ log2k ] )
ArgumentsData TypeDescription
agg_colAll data types are supported.An aggregated column for use when you create a sketch.
log2kIntegral literalOptional.
This parameter controls the number of buckets in the HLL Sketch, which affects the accuracy and storage of the sketch. This value must be an integer literal in the range of [10, 16]. It cannot be a reference to a column.
By default, log2k values larger than 11 are compressed on disk using ZSTD compression.
For details about this parameter, see HLL Accuracy with log2k Parameter.
Example This example uses HLL_SKETCH_CREATE as a window aggregate to generate 100 sketches that contain one value in each sketch.
SQL
SELECT MOD(c1, 9) AS m,
    HLL_SKETCH_CREATE(c1, 12) OVER (PARTITION BY c1) AS sketch
FROM sys.dummy100;

HLL_SKETCH_UNION (aggregate function)

Merges multiple sketches in a single column into a unified sketch. All sketches must have the same precision. This function is an aggregate function and operates on a column. Syntax
SQL
HLL_SKETCH_UNION(agg_sketch_col)
ArgumentsData TypeDescription
agg_sketch_colHLL_SKETCH<log2k>A column that contains multiple sketches to be merged into one sketch.
Example This example performs two merges. First, this code uses HLL_SKETCH_CREATE as a window aggregate to generate 100 sketches that contain one value in each sketch. The code merges the 100 sketches into nine sketches, and then the final SELECT statement merges the nine sketches into a single sketch. Also, the example uses the HLL_SKETCH_GET_ESTIMATE function to retrieve the estimated distinct count from the merged sketches.
SQL
WITH hll_sketch AS (
    SELECT MOD(c1, 9) AS m,
        HLL_SKETCH_CREATE(c1) OVER (PARTITION BY c1) AS sketch
    FROM sys.dummy100
),
hll_merged AS (
    SELECT HLL_SKETCH_UNION(hll_sketch.sketch) AS sketch
    FROM hll_sketch
    GROUP BY m
)
SELECT HLL_SKETCH_GET_ESTIMATE(HLL_SKETCH_UNION(hll_merged.sketch))
FROM hll_merged;
Output: 97

HLL_SKETCH_UNION (scalar function)

Merges two sketches to a new combined sketch. This function is a scalar function and operates row-wise. The scalar function merges two sketch columns with heterogeneous precisions into a sketch with the lower of the two precisions. Syntax
SQL
HLL_SKETCH_UNION(sketch_col1, sketch_col2)
ArgumentsData TypeDescription
sketch_col1HLL_SKETCH<log2k>A column of HLL sketches
sketch_col2HLL_SKETCH<log2k>A column of HLL sketches
Example This example creates two sketches from the dummy table. The first uses the default precision of 11, and the second uses a precision of 14. The second sketch is offset by 1000, so there are 2000 unique values being sketched overall. The query merges the two sketches using the two argument scalar union, which combines them into a new sketch. This unified sketch has a precision of 11, the lower of the two input precisions. The query then decodes the sketch, which has the expected result of 2000 ± 92.
SQL
SELECT HLL_SKETCH_GET_ESTIMATE(
        HLL_SKETCH_UNION(
            HLL_SKETCH_CREATE(c1),
            HLL_SKETCH_CREATE((c1 + 1000)::INT, 14)
        )
    ) as sketch_col
FROM sys.dummy1000;
Output: 2000 ± 92

HLL_SKETCH_GET_ESTIMATE

The HLL_SKETCH_GET_ESTIMATE scalar function converts a sketch into a distinct count estimate of a sketch value. Returns the distinct count estimate as a BIGINT. To return a distinct count across a column of sketch values, you should first merge the sketches by using the HLL_SKETCH_UNION function. Syntax
SQL
HLL_SKETCH_GET_ESTIMATE(sketch)
ArgumentsData TypeDescription
sketchHLL_SKETCH<log2k>A sketch value to convert to an approximate distinct count.
Example
SQL
WITH hll_sketch AS (
    SELECT MOD(c1, 9) as m,
        HLL_SKETCH_CREATE(c1) OVER (PARTITION BY c1) AS sketch
    FROM sys.dummy100
),
hll_merged AS (
    SELECT HLL_SKETCH_UNION(hll_sketch.sketch) AS sketch
    FROM hll_sketch
    GROUP BY m
)
SELECT HLL_SKETCH_GET_ESTIMATE(HLL_SKETCH_UNION(hll_merged.sketch))
FROM hll_merged;
Output: 97

HLL_SKETCH_GET_ESTIMATE_BOUND

The HLL_SKETCH_GET_ESTIMATE_BOUND scalar function takes a HLL_SKETCH column or an integral log2k literal value and returns the resulting bounding 95-percent confidence interval error proportion as a DOUBLE. If you provide an integral log2k value, it must be an integral literal and cannot be a column. Syntax
SQL
HLL_SKETCH_GET_ESTIMATE_BOUND(sketch_or_integral)
ArgumentsData TypeDescription
sketch_or_integralHLL_SKETCH<log2k>
or integral literal
A sketch value, column, or an integral literal.
Example
SQL
SELECT HLL_SKETCH_GET_ESTIMATE_BOUND(HLL_SKETCH_CREATE(c1, 14))
FROM sys.dummy100;
Output: 0.0163

HLL_SKETCH_TO_STRING

The HLL_SKETCH_TO_STRING scalar function takes a HLL_SKETCH column or value and returns a string summary of the sketch. Syntax
SQL
HLL_SKETCH_TO_STRING(sketch)
ArgumentsData TypeDescription
sketchHLL_SKETCH<log2k>A sketch value or column to summarize as a string.
Example
SQL
SELECT HLL_SKETCH_TO_STRING(HLL_SKETCH_CREATE(c1)) AS summary
FROM sys.dummy10000;
Output:
Text
### HLL SKETCH SUMMARY:
Log2k                   : 11
Lower bound             : 9540.000000
Estimate                : 10000
Upper bound             : 10460.000000
Number of values seen   : 10000
Minimum index           : 114
Value at minimum index  : 0
Maximum index           : 1892
Value at maximum index  : 14
This summary contains this information:
  • The log2k value. For more information, see HLL Accuracy with log2k.
  • The lower, estimate, and upper bound of the sketch.
  • The number of values seen in the sketch.
  • The minimum and maximum bucket indexes and their values in the sketch.
The output fields in this summary are for internal purposes, except for the log2k value.

Bibliography

Heule, Stefan, Marc Nunkesser, and Alex Hall. “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.” In Proceedings of the EDBT 2013 Conference. Genoa, Italy, 2013. Aggregate Functions Generate Tables Using sys.dummy
Last modified on May 21, 2026