- The search space for the minimization problem is infinite. This means that any algorithm must eventually converge to some answer but do so in a reasonable amount of time. It should not take 100 seconds to shave 1 second off of a 2-second query.
-
Plans are compared to each other based on some metric like execution time, but this metric is not trivial to build. Having a metric function
M(p)that returns a useful number and is cheap enough to compute is a challenge in and of itself. Adding to this challenge is the additional considerations beyond measuring the computational cost of plan execution, namely disk and network I/O costs as well.
-
Multi-dimensional probability density functions. If a table has
jcolumns, then it can make use of ajdimensional “table space” functionf(v0, v1, v2, ..., vj)that maps a particular set of column values to the probability of those values. The Ocient System can then compute approximate row counts in the table by taking integrals offin some hyper region of the table space. -
HyperLogLog. In addition to row counts, the ability to estimate the number of distinct values is important as well. The basic idea behind HyperLogLog is that if a set has a uniformly randomized set of values, and
nis the maximum number of leading zeros in the binary representations of all values in the set, then a good estimate of the set’s cardinality is2^n. Here, each column is considered a set, and this statistical approach can estimate the number of distinct values for each column.
T -> P0 -> P1 -> P2 ..., then the raw column cardinality estimates of the columns in T might be of limited use in calculating a metric on P2 because P0 and P1 can drastically alter the statistics of the original table.
In order to account for this, the optimizer must have mathematical models of each operator that can be applied to input statistics to produce output statistics. Then, successive application of operators can be used to build up plan metrics comprised of per-operator metrics. Clearly, successive approximation incurs increasing error. But an interesting property of using a metric in this manner is that the exact error is not important. Instead, all that is necessary is that the accumulated errors in any estimation must not invalidate the relative comparison between two plans.
The topic of plan optimization is large and complicated, and the Ocient optimizer is a key and integral component of its approach. It incorporates not only detailed implementations but also an integrated view of the distributed nature of the query execution against NVMe drives. Each iterative version of the optimizer also includes an ever-increasing set of optimizations, giving it a sophisticated tool chest of transforms to apply to plans as well.

