HyperLogLog Functions
The HyperLogLog (HLL) sketch family of functions provide an approximate count of the number of unique elements in one or more columns.
HLL functionality is similar to running a query using a COUNT(DISTINCT col) clause and uses the same mechanisms as the aggregate function APPROX_COUNT_DISTINCT.
The tradeoff for HLL sketches is accuracy. For further explanation of this tradeoff in accuracy, see the DataSketches documentation. Each sketch is an approximate and compact representation of the original data, and it introduces a small margin of error. The HLL implementation is based on the algorithm outlined in the research paper HyperLogLog in Practice. This implementation uses log2k = 11 by default, which provides a percent error of approximately 5% at the 95% confidence interval.
Sketches are useful on aggregated tables that require estimates of distinct counts.
This example creates a table for the new sketch columns ip_address_sketch and user_id_sketch as part of a CREATE TABLE AS SELECT statement. The new table inserts values from the master_agg_table table.
After the database creates the new table, you can query for the approximate count of distinct values.
This example merges the sketch columns by using an HLL_SKETCH_UNION function. After the database merges the columns, the database converts sketch values to a distinct count estimate by using the HLL_SKETCH_GET_ESTIMATE function.
In addition to the default log2k = 11 implementation, users can specify a log2k value in the range of [10, 16] when creating an HLL Sketch.
This parameter controls the number of buckets used in the HLL algorithm, which controls the accuracy of the sketch upper bounded by 1.04 / sqrt(k). The tradeoff with larger log2k values is that the sketch size grows exponentially. See the table below for specifics on how the precision values affect size and accuracy.
Precision | Uncompressed Size | 95% CI |
10 | 1032 B | ±6.50% |
11 (default) | 2056 B | ±4.60% |
12 | 4104 B | ±3.25% |
13 | 8200 B | ±2.30% |
14 | 16392 B | ±1.63% |
15 | 32776 B | ±1.15% |
16 | 65544 B | ±0.81% |
By default, log2k values larger than 11 are compressed on disk using ZSTD compression.
When creating or referencing a HLL Sketch column in a CREATE TABLE, CTAS or IAS statement, you can use the type alias HLL_SKETCH(log2k) instead of the internal HASH((2^log2k) + 8).
This type alias is supported in EXPORT statements as well.
Example
In this example, a CREATE TABLE statement creates two HLL Sketch columns with specified log2k precision values.
- sketch_a has a precision value of 10, meaning it has low precision and low storage requirements.
- sketch_b has a precision value of 15, meaning it has high precision, but also high storage requirements.
Note that the precision values specified in both HLL_SKETCH_CREATE statements match the precision values in the CREATE TABLE columns.
Creates an HLL sketch from the data on a specified aggregated column. Returns a HASH((2^log2k) + 8) data representation of the sketch that you can store in a separate column.
The HLL algorithm depends on the hash value of its input. When merging multiple sketches, it is important that each sketch derives from the same type in order to get accurate results. 1::BIGINT may hash differently than 1::INT. and thus the HLL algorithm may not view these two hashes as referring to the same value, 1.
For accurate results, users should cast numeric columns or expressions to a consistent type before they are used in a HLL_SKETCH_CREATE function, especially if they will be merged later on.
Syntax
Parameter | Data Type | Description |
agg_col | All data types are supported. | An aggregated column for use when you create a sketch. |
log2k | Integral literal | Optional. This parameter controls the number of buckets in the HLL Sketch, which affects the accuracy and storage of the sketch. This value must be an integer literal in the range of [10, 16]. It may not be a reference to a column. By default, log2k values larger than 11 are compressed on disk using ZSTD compression. For more information on this parameter, see HLL Accuracy with log2k Parameter. |
Example
This example uses HLL_SKETCH_CREATE as a window aggregate to generate 100 sketches that contain one value in each sketch.
Merges multiple sketches in a single column into a unified sketch. All sketches must have the same precision. This function is an aggregate function and operates on a column.
Syntax
Parameter | Data Type | Description |
agg_sketch_col | HLL_SKETCH<log2k> | A column that contains multiple sketches to be merged into one sketch. |
Example
This example performs two merges. First, this code uses HLL_SKETCH_CREATE as a window aggregate to generate 100 sketches that contain one value in each sketch. The code merges the 100 sketches into nine sketches, and then the final SELECT statement merges the nine sketches into a single sketch. Also, the example uses the HLL_SKETCH_GET_ESTIMATE function to retrieve the estimated distinct count from the merged sketches.
Output: 97
Merges two sketches to a new combined sketch. This function is a scalar function and operates row-wise. The scalar function merges two sketch columns with heterogeneous precisions into a sketch with the lower of the two precisions.
Syntax
Parameter | Data Type | Description |
sketch_col1 | HLL_SKETCH<log2k> | A column of HLL sketches |
sketch_col2 | HLL_SKETCH<log2k> | A column of HLL sketches |
Example
This example creates two sketches from the dummy table. The first uses the default precision, 11, and the second uses a precision of 14. The second sketch is offset by 1000, so there are 2000 unique values being sketched overall.
The query merges the two sketches using the two argument scalar union, which combines them into a new sketch. This unified sketch has a precision of 11, the lower of the two input precisions.
The query then decodes the sketch, which has the expected result of 2000 ± 92.
Output: 2000 ± 92
The HLL_SKETCH_GET_ESTIMATE scalar function converts a sketch into a distinct count estimate of a sketch value. Returns the distinct count estimate as a BIGINT.
To return a distinct count across a column of sketch values, you should first merge the sketches by using the HLL_SKETCH_UNION function.
Syntax
Parameter | Data Type | Description |
sketch | HLL_SKETCH<log2k> | A sketch value to convert to an approximate distinct count. |
Example
Output: 97
The HLL_SKETCH_GET_ESTIMATE_BOUND scalar function takes a HLL_SKETCH column or an integral log2k literal value and returns the resulting bounding 95-percent confidence interval error proportion as a DOUBLE. If the user provides an integral log2k value, it must be a integral literal and may not be a column.
Syntax
Parameter | Data Type | Description |
sketch_or_integral | HLL_SKETCH<log2k> or integral literal | A sketch value, column, or an integral literal. |
Example
Output: 0.0163
The HLL_SKETCH_TO_STRING scalar function takes a HLL_SKETCH column or value and returns a string summary of the sketch.
Syntax
Parameter | Data Type | Description |
sketch | HLL_SKETCH<log2k> | A sketch value or column to summarize as a string. |
Example
Output:
This summary contains this information:
- The lower, estimate, and upper bound of the sketch.
- The number of values seen in the sketch.
- The minimum and maximum bucket indexes and their values in the sketch.
The output fields in this summary are for internal purposes expect for the log2k value.
Heule, Stefan, Marc Nunkesser, and Alex Hall. “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.” In Proceedings of the EDBT 2013 Conference. Genoa, Italy, 2013.