Exabyte Scalability Design
Various system components in have each been designed to scale not only with hardware capabilities as technology progresses, but also in number of nodes storing and processing data. Like any distributed system, issues of fault tolerance, availability, and communication all become increasingly challenging as the number of units increases.
The Ocient scalability approach can be broadly described as deriving from clusters of clusters: smaller, locally connected clusters that manage their own availability and local state independently and without consideration to any other clusters in the system. The design goal of this is to minimize global state and global synchronization in any aspect of query processing, fault management, data loading, or data ownership.
There are different kinds of clusters: metadata clusters, storage clusters, VM clusters, and loading clusters. An Ocient deployment can contain zero of more of each. With respect to scalability, each cluster runs its own Raft-based consensus protocol and maintains its own local shared state with it. Individual nodes can belong to multiple clusters and participate in multiple disjoint consensus protocols at the same time, but system scalability derives from the fact that no node has any global view of system state, and indeed, no such global state exists.
This allows Ocient to quantify the upper bound on the N parameter described in the erasure coding section. Recall that N is the number of segments in a segment group. In a storage cluster, segment ownership and availability is confined to and maintained in the cluster-local consensus protocol. A necessary implication of data management rules for atomic (but still local) state changes is that a particular segment group must be stored entirely within one storage cluster. And because the Ocient system cannot place two segments from the same group onto the same node, the value N cannot exceed the number of nodes in a target storage cluster. This finally means that the upper bound on N is the practical limit on the consensus protocol implementation, coupled with the design goal of not allowing clusters to be "too big": somewhere in the range of 20 - 40, depending on deployment and schema details.
Query execution necessarily spans multiple VM and storage clusters. A key phase of query processing, prior to any actual computation is the query probe, which establishes the query tree to be used for that particular query. This tree fixes, for the particular execution with which it is associated, the set of nodes that will provide which segments across all storage clusters, as well as the set of nodes that will co-participate in the distributed execution of the query. The former is important because as nodes and drives come and go, and as new data is loaded and old data is removed, it is imperative that all segments that should be included in a query actually are included exactly once. The latter is important because each node might need to know exactly which subset of peers in the network are performing which aspects of query execution. In both cases, the probe phase enables query execution to proceed without any global coordination or synchronization.
This graphic shows a query execution tree, which can scale up by increasing the number of clusters (represented horizontally in the graphic).
At the top level of the query execution tree is the root operator, which has zero or more additional plan levels below it, depending on the complexity of the query. Each circle represents individual nodes, which provide segments across the storage clusters or specific tasks for the execution of the query.
Vertically, the graphic shows different plan levels that fan out among the nodes for the stages of the query plan, actions such as transformations, sorting, grouping or other operations. The leaf level at the bottom of the graphic represents the I/O operation that reads the queried data from storage.
When the system initiates query execution, each cluster executes its operations in parallel. For the most part, clusters operate in isolation unless they need to communicate for operations such as a join or shuffle.
When referring to the query plan, each level is delineated by the presence of a special gather operator that represents the network boundary between levels. All the plan operators between root and the first gather operator are said to execute at level 1. Any plan operators that exist between the first gather operator and the second gather operator execute at level 2, and so on, until the final gather operator and the set of operators below it until the leaf I/O operators that represent the leaf level.
When referring to the query tree, each gather operator represents a fan-out from one level N gather operator to some fixed number of N+1 nodes. During execution a given level N node is responsible for coordinating and amalgamating partial query results from the level N+1 nodes below it. By definition there is exactly one level 1 node in a query, fanning out to some number of level 2 nodes, each of which itself fans out to some number of level 3 nodes, and so on. The fan-out factor is closely aligned with the cluster sizes in use.
In both cases, the tree structure promotes scalable query execution in a fault tolerant way.
Query Ocient