Compute-Adjacent I/O on Large Working Sets
The storage of the data warehouse and I/O layer is built using a (CASA). When contrasted against modern cloud-based database engines, this means that the majority of the data to be analyzed is stored on NVMe SSDs directly attached to a significant fraction of the computational capacity of the system using the local PCIe busses of the nodes that make up the foundation level of the database engine.
This approach, which is similar to a traditional MPP architecture, is designed to capitalize on the performance characteristics of NVMe SSDs when executing queries against a large working set of data. It is important to note that when executing a query, a given record could be included as part of the results but I/O and computational effort must be undertaken to make that determination. A working set, therefore, is defined as the set of records stored in the database that are considered as part of query execution, even if those records are eventually filtered out or not returned. A "large" working set is one that is substantially larger, even by multiple orders of magnitude, than the available DRAM on the set of nodes performing the computations. When this is the case, remote-object database engines that store the majority of their data in a separate storage layer are substantially affected by their inability to keep all the potentially-addressed records near and accessible to computation at the same time. Depending on the query load and their exact engine design and scheduling, query execution times negatively suffer from both simple bottlenecks due to network link capacity and "cache thrashing" that produces expensive extra drive I/O.
Large working sets can be realized in two major ways:
- Queries that regularly consider a substantial fraction of the entire data set. For example, regular querying or analysis of all data spanning years can mean a single query addresses most of the records in the database.
- A large number of smaller queries that all consider a disjoint subset of the whole data set. When these queries are executed in parallel the sum total of the disjoint sets of addressed records can also approach most of the records in the database.
The database engine in the system is a world-class relational database engine that can go toe-to-toe with other modern engines on small working sets but won’t necessarily out-perform them. After all, well known algorithms like sorting and hashing can only be made so fast when everything is in memory. But Ocient’s performance distinctions derive from its ability to scale CASA to keep all of the data close to the computation.
One of the biggest drivers for the development of the was the ongoing shift from spinning HDDs to NVMe SSDs for data storage. SSD prices continue to fall, relative to HDDs, and there remains little doubt that they will eventually supplant HDDs for all but esoteric or extreme bulk storage applications.
To ride this storage wave, the Ocient engine has been built to minimize the cost of using these drives as the primary storage tier of the engine while maximizing their performance. There are two major aspects to cost management:
- Minimization of the total number of stored bytes needed to achieve reliable access to data in a performant manner. This is achieved using a novel and tunable erasure coding scheme in conjunction with a variety of compression techniques. This is detailed more thoroughly in the following sections.
- Consideration for endurance management in I/O operations against the NVMe SSDs. The Ocient storage and execution engine is designed to minimize the number of I/O write operations undertaken against the storage media in order to maximize its lifetime and reduce drive replacement rate due to endurance wear. At a high level, this usually means minimizing small or random writes to the drives, and instead favoring larger and sequential ones. To achieve this, the storage layer uses a specially-written and database-aware file system underneath storage structures designed to capitalize on the large volume of low-mutability data being stored. Using these mechanisms, the storage layer is able to collate writes in a manner that produces relatively large (from the drive’s point of view) ranges of contiguous blocks, ordered in the hundreds of KiB to hundreds of MiB in size.
Of particular note in the storage and I/O layer of the Ocient Hyperscale Data Warehouse is the novel use of a tunable erasure coding scheme for data storage.
You can think of erasure coding as a generalization of simple XOR parity calculations. In a simple error correction scheme, you can apply the XOR operation on a sequence of bits: B0, B1, B2, ..., Bn such that: B0 XOR B1 XOR ... XOR Bn = Bx. An easily provable property of the XOR operation is that for any single Bi with 0 ≤ i ≤ n, that B0 XOR ... XOR Bi-1 XOR Bi+1 XOR ... XOR Bn XOR Bx = Bi. You can determine any single bit by performing the XOR operation on the other original bits and the extra bit together. In the context of error correction, this means that for the cost of an extra bit of storage (or memory) a system is able to detect, and even correct for, the loss of a single bit.
There are a variety of error correction schemes described in computer science and found in the industry. In the application of this concept to database storage, the Ocient storage layer extends it from individual bits, and beyond bytes, to an error correction unit of the 4KiB block (which is notably aligned with the I/O resolution of standard NVMe SSD drives). It also further extends from enabling single loss or failure to a fully tunable approach that uses XOR, P+Q, or Cauchy-Reed-Solomon coding that allows the user to specify exactly how many faults: one, two, or K, respectively. Like the simple XOR formulation, these schemes utilize mathematical computations (in this case, polynomials over a finite/Galois field) to compute additional blocks of data that enable reconstructing a configurable number of missing blocks.
In database and storage systems that utilize replication to provide data reliability and availability, it is common to describe storage in terms of the number of replicas (e.g., 3x replication allows for up to three failures). The obvious implication of replication is that each replica is a full copy of the original data and represents a multiple of the minimal data storage required. In an erasure coding scheme, the overhead for K tolerable faults is calculated as (N + K)/N, where N is the total width and K is parity width. If N is 10 (take that as a given for the moment) then to support a maximum of 2 faults (the equivalent of 3x replication, which includes the original data and two copies) the system needs only 12 / 10 = 1.2x times the original storage size, contrasted with the 3x of the replication scheme.
The utilization of space-efficient erasure coding can be found in a variety of storage and transmission systems but, in a classic space-time trade off, the space savings come at a computational cost. The approach taken in the Ocient storage layer is novel, at least for databases, and is designed to mitigate the computational overhead of erasure coding.
If one considers a "file" of some size as a unit of storage, a basic approach for applying an erasure coding scheme to provide loss tolerance to this file would be to break it into N chunks and perform the requisite computation to produce an additional K chunks. At read time, the system would select an available subset of the N+K chunks (possibly preferring the original N) and perform the rebuild computation to produce the original file for access. In this framing the unit of fault tolerance is the individual file, and rebuilding computational overhead can be found in the read path.
However, in a database designed for OLAP-style queries, the potential for the overhead of any rebuilding computation in the read path even when no drives or nodes are offline is troublesome. Additionally, when the system is in a degraded state, the potential need to rebuild an entire file only to retrieve a small subset of its data is expensive as well. The Ocient storage layer, by directly implementing the erasure coding scheme and integrating with it, avoids these issues in two major ways.
The first advantage is the unit of storage. Ocient uses storage units called segments, which are not chunked into N pieces. Instead, N separate and similarly sized segments are considered together as a unit called a segment group. The erasure coding scheme is then applied across the segments to produce K additional "segments" worth of redundancy information which, in a very loose application of the term, we call parity data.
This graphic shows a high-level illustration of how Ocient stores each data segment. Segments contain all the data needed for a subset of table rows, including the actual column data as well as indexes, metadata, and statistical data such as probability density functions (PDFs) and count distinct estimates (CDEs). In addition, the erasure coding scheme also includes parity data for other segments in the same segment group.
The newly generated parity data is then broken up and evenly distributed amongst the original segments (in a particular and calculated way) such that each segment is now approximately (N+K)/N times larger than it originally was, and any missing segment from a particular segment group can be reconstructed when one has access to any set of N segments from the same group. The Ocient storage layer ensures that each segment from a group is stored on a separate physical node, thereby providing data availability in the event of up to K node losses.
Another important feature of this scheme is due to the fact that a given record is always stored in a single segment. Because the segments themselves are not chunked, any given record is guaranteed to exist intact on some node/drive, and readable with no overhead incurred when that drive is online. Put another way, in the non-degraded state, the reliability scheme implemented incurs no computational cost.
The second advantage of the Ocient system is each segment acts as a self-contained unit of information, containing not only database records but also metadata and indexes on those records. These sub-units within the segment, called segment parts are individually addressable and independently rebuildable with 4KiB resolution. The implication of this is that in a degraded state (where a node or drive is missing) it is not strictly necessary to rebuild all missing segments in order to service a given query. Instead the storage layer is capable of addressing and rebuilding only the parts and even single blocks of parts required for it. This often greatly reduces the overhead and performance impact of a degraded state.
Earlier, it was taken as a given that N is equal to 10. The exact choice of this value is somewhat arbitrary but is generally bounded by system-level consensus constraints as well as the calculated cost of the network transit during rebuild operations. As N increases, the storage overhead for the data reliability for some given K decreases because (N+K)/N approaches 1. However, the total amount of information that must be considered and moved over the network for, as well as the actual computational cost of, a rebuild increases. Real-world usage has found that an N somewhere between 8 and 12 provides a good tradeoff between these two opposing considerations.
A final important aspect of the Ocient storage and I/O layer is the database engine’s direct interaction with the SSDs storing segments. The term "direct" in this context can be contrasted against how a more standard approach might look. In many database systems, record storage is achieved using a set of files stored on a disk in some file system (such as EXT4 or NTFS or ZFS). And generally speaking, reading and writing of those files will be performed using kernel-provided system calls/services like read() , write(), and mmap(). Obviously this is a fine approach in general, but when performance is of utmost importance, the overhead of the system calls, memory copies to and from kernel memory, as well as the implementation of the underlying file systems all have detrimental, and sometimes unpredictable, effects. For example, it is well known that the kernel’s EXT4 implementation takes global locks during certain metadata operations which can have a significant detrimental effect on parallelism and throughput.
Some databases eschew standard files and perform some form of I/O directly against kernel block devices in order to avoid file system overhead. This is surely an improvement, but the kernel’s block scheduler, extra memory copies to/from kernel memory, and the overhead of system calls still exist.
In order to overcome these performance impacts, the Ocient storage layer directly interacts with the NVMe SSDs that store segments in two important ways:
- Each NVMe drive is decoupled from the standard Linux block device driver and instead attached to a device driver that enables low-level interaction with the device using the PCIe protocol. The implications of this approach are that the storage layer and database engine can directly communicate the NVMe protocol to the devices and do so using user memory. This completely eliminates all system calls and memory copies required to do I/O with the NVMe drives. The database provides direct pointers to its own virtual memory for I/O operations and the drives directly write and read to and from that memory. An interesting result from this approach is that the Ocient storage layer and engine were entirely unaffected by the Linux kernel’s mitigations for the Spectre and Meltdown classes of attacks. This is because the primary mechanism of the mitigation resulted in a substantial increase in the overhead of the user-to-kernel transition associated with system calls. Other database engines with more standard approaches saw in many cases appreciable performance degradation.
- Modern NVMe SSDs are capable of a million 4KiB random reads per second. This is strikingly and substantially more than HDDs (where even the best drives can produce only hundreds). However, for most economical SSDs, these rates can only be achieved when large numbers of parallel in-flight requests are maintained over a substantial fraction of the drive’s address space. The Ocient storage layer implementation, in conjunction with specially tailored data and metadata structures, not only promote this parallelism to be evoked during query execution, but also are written to minimize "latency bubbles" by always having a replacement I/O operation teed up for a drive when it completes a previous one.
As a column-oriented data warehouse, Ocient is able to leverage per-column data locality and similarity to achieve excellent data compression.
- For fixed-length columns, a combination of delta-delta compression, run-length encoding, and null elimination provide significant storage reduction for many datatypes.
- For variable-length columns, the Ocient system can apply per-row compression to reduce the size of large values.
- Both fixed and variable-length columns can benefit from whole-column compression using a shared dictionary trained on a per-column and per-segment basis.
Column compression works best when combined with secondary indexes on frequently filtered columns.
Key Concepts