# Plan Optimization

A foundational element of any high performance database is its plan optimization engine. Recall that a * query plan* is in many ways similar to a generic computer program. A key distinction, however, is that its machine code is in the language of relational algebra, set operations, I/O, network, and other database primitives. When a database converts SQL queries into a query plan, it must always generate a plan that produces the correct query results. An over-used joke at is that if we were allowed to return incorrect results, we could make a blazingly fast database by simply returning random numbers for every query. By way of analogy, this correct result requirement is the same as saying that a C++ compiler must always produce x86 machine code that correctly executes the originally provided instructions. But just as a given C++ program can have many different possible representations in x86 machine code (different compilers, optimizations, etc) it is also the case that a given SQL query can have many (infinitely, actually) different query plans.

In a database, the optimizer’s job is to take an already-correct plan that is naively generated from a SQL statement and produce an alternative plan that will emit the same results. This is normally viewed as a minimization problem where the alternative plan is "less than" the original plan in some metric (usually estimated run time) and database engines have evolved extremely varied approaches to solving it. Some challenges to consider:

- The search space for the minimization problem is infinite. This means that any algorithm must eventually converge to some answer but do so in a reasonable amount of time. It should not spend 100 seconds to shave 1 second off of a 2 second query.
- Plans are compared to each other based on some metric like execution time, but this metric is not trivial to build. Having a metric function M(p) that returns a useful number and is cheap enough to compute is a challenge in and of itself. Adding to this challenge is the additional considerations beyond measuring the computational cost of plan execution, namely disk and network I/O costs as well.

Most database engines incorporate some sort of statistical approach to the above. Statistics on the tables being queried (calculated either at data load time, or periodically updated) are used to bound execution and I/O estimates. Ocient’s optimizer does this as well. While basic statistics like row counts are certainly considered, more advanced mathematical approaches are used as well:

*Multi-dimensional probability density functions*. If a table has j columns, then we can consider a j dimensional "table space" function f(v0, v1, v2, ..., vj) that maps a particular set of column values to the probability of those values, then we can compute approximate row counts in the table by taking integrals of f in some hyper region of the table space.*HyperLogLog*. In addition to row counts, the ability to estimate the number of*distinct*values is important as well. The basic idea behind HyperLogLog is that if a set has a uniformly randomized set of values, and n is the maximum number of leading zeros in the binary representations of all values in the set, then a good estimate of the set’s cardinality is 2^n. Here, each column is considered a set, and we can use this statistical approach to estimate the number of distinct values for each column.

With such mathematical machinery available, we can see how the beginnings of a metric function for a plan can be built, but a few important details remain. Firstly, while the statistical approaches can make for good estimates, we must be able to transform from the domain of SQL types into numbers for each supported SQL type. This is sometimes straight-forward in cases like integral and floating point types, but becomes arbitrary and less obvious with things like variable length strings and arrays. Secondly, and more importantly, we must also be able to calculate our metric on *sequences* of operators, each of which can be considered as a separate transformation of input to output. Put another way, if given a sequence of plan operators each successively operating on rows: T -> P0 -> P1 -> P2 ... then the raw column cardinality estimates of the columns in T may be of limited use in calculating a metric on P2 because P0 and P1 can drastically alter the statistics of the original table.

In order to account for this, the optimizer must have mathematical models of each operator that can be applied to *input statistics* to produce *output statistics*. Then, successive application of operators can be used to build up plan metrics comprised of per-operator metrics. Clearly, successive approximation will incur increasing error. But an interesting property of using a metric in this manner is that the exact error is not important. Instead, all that is necessary is that the accumulated errors in any estimation must not invalidate the relative comparison between two plans.

The topic of plan optimization is large and complicated, and the Ocient optimizer is a key and integral component of its approach. It incorporates not only detailed implementations of the above but also an integrated view of the distributed nature of the query execution against NVMe drives. Each iterative version of the optimizer also includes an ever-increasing set of optimizations, giving it a sophisticated tool chest of transforms to apply to plans as well.