SQL Reference

HyperLogLog Functions

the hyperloglog (hll) sketch family of functions provide an approximate count of the number of unique elements in one or more columns hll functionality is similar to running a query using a count(distinct col) clause and uses the same mechanisms as the aggregate functions docid 72ic0blcf uqzlzeq1 j5 approx count distinct the tradeoff for hll sketches is accuracy for further explanation of this tradeoff in accuracy, see the {{apache}} datasketches documentation each sketch is an approximate and compact representation of the original data, and it introduces a small margin of error the {{ocient}} hll implementation is based on the algorithm outlined in the research paper hyperloglog in practice this implementation uses log2k = 11 by default, which provides a percent error of approximately 5% at the 95% confidence interval hll sketch functions in the sql statement sketches are useful on aggregated tables that require estimates of distinct counts this example creates a table for the new sketch columns ip address sketch and user id sketch as part of a create table as select statement the new table inserts values from the master agg table table create table my sketch table ( col a int, col b varchar(255), ip address sketch hll sketch(11), user id sketch hll sketch(11) ) as ( select col a, col b, hll sketch create(ip address) as ip address sketch, hll sketch create(user id) as user id sketch from master agg table group by 1,2); after the database creates the new table, you can query for the approximate count of distinct values this example merges the sketch columns by using an hll sketch union function after the database merges the columns, the database converts sketch values to a distinct count estimate by using the hll sketch get estimate function select hll sketch get estimate(hll sketch union(ip address sketch)) as ip address sketch, hll sketch get estimate(hll sketch union(user id sketch)) as user id sketch from my sketch table; hll accuracy with log2k parameter in addition to the default log2k = 11 implementation, you can specify a log2k value in the range of \[10, 16] when creating an hll sketch this parameter controls the number of buckets used in the hll algorithm, which controls the accuracy of the sketch upper bounded by 1 04 / sqrt(k) the tradeoff with larger log2k values is that the sketch size grows exponentially see this table for specifics on how the precision values affect size and accuracy precision uncompressed size 95% ci 10 1032 b ±6 50% 11 (default) 2056 b ±4 60% 12 4104 b ±3 25% 13 8200 b ±2 30% 14 16392 b ±1 63% 15 32776 b ±1 15% 16 65544 b ±0 81% by default, log2k values larger than 11 are compressed on disk using zstd compression when creating or referencing a hll sketch column in a create table , ctas or ias statement, you can use the type alias hll sketch(log2k) instead of the internal hash((2^log2k) + 8) this type alias is supported in export statements as well example in this example, a create table statement creates two hll sketch columns with specified log2k precision values sketch a has a precision value of 10 , meaning it has low precision and low storage requirements sketch b has a precision value of 15 , meaning it has high precision and high storage requirements note that the precision values specified in both hll sketch create statements match the precision values in the create table columns create table my sketch table ( sketch a hll sketch(10), sketch b hll sketch(15) ) as ( select hll sketch create(col a, 10) as sketch a, hll sketch create(col b, 15) as sketch b from master agg table group by 1,2); supported hll sketch functions hll sketch create creates an hll sketch from the data on a specified aggregated column returns a hash((2^log2k) + 8) data representation of the sketch that you can store in a separate column the hll algorithm depends on the hash value of its input when merging multiple sketches, it is important that each sketch derives from the same type in order to get accurate results 1 bigint might hash differently than 1 int , and thus, the hll algorithm might not view these two hashes as referring to the same value 1 for accurate results, users should cast numeric columns or expressions to a consistent type before they are used in a hll sketch create function, especially if they will be merged later on syntax hll sketch create(agg col, \[ log2k ] ) arguments data type description agg col all data types are supported an aggregated column for use when you create a sketch log2k integral literal optional this parameter controls the number of buckets in the hll sketch, which affects the accuracy and storage of the sketch this value must be an integer literal in the range of \[10, 16] it cannot be a reference to a column by default, log2k values larger than 11 are compressed on disk using zstd compression for details about this parameter, see hyperloglog functions docid\ zbvl 3nxtbsgnxvinaufl example this example uses hll sketch create as a window aggregate to generate 100 sketches that contain one value in each sketch select mod(c1, 9) as m, hll sketch create(c1, 12) over (partition by c1) as sketch from sys dummy100; hll sketch union (aggregate function) merges multiple sketches in a single column into a unified sketch all sketches must have the same precision this function is an aggregate function and operates on a column syntax hll sketch union(agg sketch col) arguments data type description agg sketch col hll sketch\<log2k> a column that contains multiple sketches to be merged into one sketch example this example performs two merges first, this code uses hll sketch create as a window aggregate to generate 100 sketches that contain one value in each sketch the code merges the 100 sketches into nine sketches, and then the final select statement merges the nine sketches into a single sketch also, the example uses the hll sketch get estimate function to retrieve the estimated distinct count from the merged sketches with hll sketch as ( select mod(c1, 9) as m, hll sketch create(c1) over (partition by c1) as sketch from sys dummy100 ), hll merged as ( select hll sketch union(hll sketch sketch) as sketch from hll sketch group by m ) select hll sketch get estimate(hll sketch union(hll merged sketch)) from hll merged; output 97 hll sketch union (scalar function) merges two sketches to a new combined sketch this function is a scalar function and operates row wise the scalar function merges two sketch columns with heterogeneous precisions into a sketch with the lower of the two precisions syntax hll sketch union(sketch col1, sketch col2) arguments data type description sketch col1 hll sketch\<log2k> a column of hll sketches sketch col2 hll sketch\<log2k> a column of hll sketches example this example creates two sketches from the dummy table the first uses the default precision of 11, and the second uses a precision of 14 the second sketch is offset by 1000, so there are 2000 unique values being sketched overall the query merges the two sketches using the two argument scalar union, which combines them into a new sketch this unified sketch has a precision of 11, the lower of the two input precisions the query then decodes the sketch, which has the expected result of 2000 ± 92 select hll sketch get estimate( hll sketch union( hll sketch create(c1), hll sketch create((c1 + 1000) int, 14) ) ) as sketch col from sys dummy1000; output 2000 ± 92 hll sketch get estimate the hll sketch get estimate scalar function converts a sketch into a distinct count estimate of a sketch value returns the distinct count estimate as a bigint to return a distinct count across a column of sketch values, you should first merge the sketches by using the hll sketch union function syntax hll sketch get estimate(sketch) arguments data type description sketch hll sketch\<log2k> a sketch value to convert to an approximate distinct count example with hll sketch as ( select mod(c1, 9) as m, hll sketch create(c1) over (partition by c1) as sketch from sys dummy100 ), hll merged as ( select hll sketch union(hll sketch sketch) as sketch from hll sketch group by m ) select hll sketch get estimate(hll sketch union(hll merged sketch)) from hll merged; output 97 hll sketch get estimate bound the hll sketch get estimate bound scalar function takes a hll sketch column or an integral log2k literal value and returns the resulting bounding 95 percent confidence interval error proportion as a double if you provide an integral log2k value, it must be an integral literal and cannot be a column syntax hll sketch get estimate bound(sketch or integral) arguments data type description sketch or integral hll sketch\<log2k> or integral literal a sketch value, column, or an integral literal example select hll sketch get estimate bound(hll sketch create(c1, 14)) from sys dummy100; output 0 0163 hll sketch to string the hll sketch to string scalar function takes a hll sketch column or value and returns a string summary of the sketch syntax hll sketch to string(sketch) arguments data type description sketch hll sketch\<log2k> a sketch value or column to summarize as a string example select hll sketch to string(hll sketch create(c1)) as summary from sys dummy10000; output \### hll sketch summary log2k 11 lower bound 9540 000000 estimate 10000 upper bound 10460 000000 number of values seen 10000 minimum index 114 value at minimum index 0 maximum index 1892 value at maximum index 14 this summary contains this information the log2k value for more information, see hyperloglog functions docid\ zbvl 3nxtbsgnxvinaufl the lower, estimate, and upper bound of the sketch the number of values seen in the sketch the minimum and maximum bucket indexes and their values in the sketch the output fields in this summary are for internal purposes, except for the log2k value bibliography heule, stefan, marc nunkesser, and alex hall “hyperloglog in practice algorithmic engineering of a state of the art cardinality estimation algorithm ” in proceedings of the edbt 2013 conference genoa, italy, 2013 related links aggregate functions docid 72ic0blcf uqzlzeq1 j5 generate tables using sys dummy docid\ udbs dxonkysghxlhtbbr