The HyperLogLog (HLL) sketch family of functions provides an approximate count of the number of unique elements in one or more columns. HLL functionality is similar to running a query using aDocumentation Index
Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
Use this file to discover all available pages before exploring further.
COUNT(DISTINCT col) clause and uses the same mechanisms as the aggregate function APPROX_COUNT_DISTINCT.
The tradeoff for HLL sketches is accuracy. For further explanation of this tradeoff in accuracy, see the DataSketches documentation. Each sketch is an approximate and compact representation of the original data, and it introduces a small margin of error. The HLL implementation is based on the algorithm outlined in the research paper HyperLogLog in Practice. This implementation uses log2k = 11 by default, which provides a percent error of approximately 5% at the 95% confidence interval.
HLL Sketch Functions in the SQL Statement
Sketches are useful on aggregated tables that require estimates of distinct counts. This example creates a table for the new sketch columnsip_address_sketch and user_id_sketch as part of a CREATE TABLE AS SELECT statement. The new table inserts values from the master_agg_table table.
SQL
HLL_SKETCH_UNION function. After the database merges the columns, the database converts sketch values to a distinct count estimate by using the HLL_SKETCH_GET_ESTIMATE function.
SQL
HLL Accuracy with log2k Parameter
In addition to the defaultlog2k = 11 implementation, you can specify a log2k value in the range of [10, 16] when creating an HLL Sketch.
This parameter controls the number of buckets used in the HLL algorithm, which controls the accuracy of the sketch upper bounded by 1.04 / sqrt(k). The tradeoff with larger log2k values is that the sketch size grows exponentially. See this table for specifics on how the precision values affect size and accuracy.
| Precision | Uncompressed Size | 95% CI |
|---|---|---|
| 10 | 1032 B | ±6.50% |
| 11 (default) | 2056 B | ±4.60% |
| 12 | 4104 B | ±3.25% |
| 13 | 8200 B | ±2.30% |
| 14 | 16392 B | ±1.63% |
| 15 | 32776 B | ±1.15% |
| 16 | 65544 B | ±0.81% |
log2k values larger than 11 are compressed on disk using ZSTD compression.
When creating or referencing a HLL Sketch column in a
CREATE TABLE, CTAS or IAS statement, you can use the type alias HLL_SKETCH(log2k) instead of the internal HASH((2^log2k) + 8).This type alias is supported in EXPORT statements as well.log2k precision values.
sketch_ahas a precision value of10, meaning it has low precision and low storage requirements.sketch_bhas a precision value of15, meaning it has high precision and high storage requirements.
SQL
Supported HLL Sketch Functions
HLL_SKETCH_CREATE
Creates an HLL sketch from the data on a specified aggregated column. Returns aHASH((2^log2k) + 8) data representation of the sketch that you can store in a separate column.
The HLL algorithm depends on the hash value of its input. When merging multiple sketches, it is important that each sketch derives from the same type in order to get accurate results.
1::BIGINT might hash differently than 1::INT, and thus, the HLL algorithm might not view these two hashes as referring to the same value 1.For accurate results, users should cast numeric columns or expressions to a consistent type before they are used in a HLL_SKETCH_CREATE function, especially if they will be merged later on.SQL
| Arguments | Data Type | Description |
|---|---|---|
agg_col | All data types are supported. | An aggregated column for use when you create a sketch. |
log2k | Integral literal | Optional. This parameter controls the number of buckets in the HLL Sketch, which affects the accuracy and storage of the sketch. This value must be an integer literal in the range of [10, 16]. It cannot be a reference to a column. By default, log2k values larger than 11 are compressed on disk using ZSTD compression.For details about this parameter, see HLL Accuracy with log2k Parameter. |
HLL_SKETCH_CREATE as a window aggregate to generate 100 sketches that contain one value in each sketch.
SQL
HLL_SKETCH_UNION (aggregate function)
Merges multiple sketches in a single column into a unified sketch. All sketches must have the same precision. This function is an aggregate function and operates on a column. SyntaxSQL
| Arguments | Data Type | Description |
|---|---|---|
agg_sketch_col | HLL_SKETCH<log2k> | A column that contains multiple sketches to be merged into one sketch. |
HLL_SKETCH_CREATE as a window aggregate to generate 100 sketches that contain one value in each sketch. The code merges the 100 sketches into nine sketches, and then the final SELECT statement merges the nine sketches into a single sketch. Also, the example uses the HLL_SKETCH_GET_ESTIMATE function to retrieve the estimated distinct count from the merged sketches.
SQL
97
HLL_SKETCH_UNION (scalar function)
Merges two sketches to a new combined sketch. This function is a scalar function and operates row-wise. The scalar function merges two sketch columns with heterogeneous precisions into a sketch with the lower of the two precisions. SyntaxSQL
| Arguments | Data Type | Description |
|---|---|---|
sketch_col1 | HLL_SKETCH<log2k> | A column of HLL sketches |
sketch_col2 | HLL_SKETCH<log2k> | A column of HLL sketches |
SQL
2000 ± 92
HLL_SKETCH_GET_ESTIMATE
TheHLL_SKETCH_GET_ESTIMATE scalar function converts a sketch into a distinct count estimate of a sketch value. Returns the distinct count estimate as a BIGINT.
To return a distinct count across a column of sketch values, you should first merge the sketches by using the HLL_SKETCH_UNION function.
Syntax
SQL
| Arguments | Data Type | Description |
|---|---|---|
sketch | HLL_SKETCH<log2k> | A sketch value to convert to an approximate distinct count. |
SQL
97
HLL_SKETCH_GET_ESTIMATE_BOUND
TheHLL_SKETCH_GET_ESTIMATE_BOUND scalar function takes a HLL_SKETCH column or an integral log2k literal value and returns the resulting bounding 95-percent confidence interval error proportion as a DOUBLE. If you provide an integral log2k value, it must be an integral literal and cannot be a column.
Syntax
SQL
| Arguments | Data Type | Description |
|---|---|---|
sketch_or_integral | HLL_SKETCH<log2k>or integral literal | A sketch value, column, or an integral literal. |
SQL
0.0163
HLL_SKETCH_TO_STRING
TheHLL_SKETCH_TO_STRING scalar function takes a HLL_SKETCH column or value and returns a string summary of the sketch.
Syntax
SQL
| Arguments | Data Type | Description |
|---|---|---|
sketch | HLL_SKETCH<log2k> | A sketch value or column to summarize as a string. |
SQL
Text
- The
log2kvalue. For more information, see HLL Accuracy with log2k. - The lower, estimate, and upper bound of the sketch.
- The number of values seen in the sketch.
- The minimum and maximum bucket indexes and their values in the sketch.
The output fields in this summary are for internal purposes, except for the
log2k value.
