Power Safety Recommendations
Like any distributed data processing system, power safety in is important to ensure data integrity. Data integrity in Ocient comprises newly loaded data that is in the process of being written to storage and metadata files that contain information about the state of the system.
Ensure that you have Power Loss Protection included in the NVMe drives. The Ocient system does provide some protections against power loss. So, follow these best practices for power management.
The Ocient Hyperscale Data Warehouse uses NVMe Solid State Drives (SSD) to store data.
To improve I/O performance, SSDs cache data in a volatile memory temporary buffer before it is flushed to non-volatile flash memory. During a normal shutdown, the host system notifies the SSD that the system is shutting down so it can flush data to non-volatile storage. However, unexpected power loss can result in loss or corruption of this data if this volatile memory is not preserved. To protect against this data loss, NVMe drives validated for use with Ocient systems must include Power Loss Protection (PLP) mechanisms. PLP are standard in enterprise and datacenter grade NVMe drives. They are typically implemented using super capacitors that can hold enough power to allow safe writing of cached data to non-volatile memory in the event of a power loss to ensure data integrity.
While data is loaded it temporarily resides in Memory (RAM) or Disk Caches. The data stored in either of these locations is safely persisted to disks during a planned shutdown of nodes. The operating system or software might not be able to persist this data to disk during an unplanned power loss or system crash, so Ocient is designed to handle this scenario gracefully.
Ocient provides redundancy mechanisms using clustering to prevent metadata loss due to sudden power outage on a single node. The Administrator cluster consists of multiple nodes performing the Administrator role that maintain system metadata. This prevents metadata loss when a single node is lost or damaged.
Foundation storage clusters use parity configuration to protect data in the event of sudden power loss to a Foundation Node. In such an event, inflight queries might fail but without any permanent loss of data. Dynamic rebuilding of missing data allows a configurable number of Foundation Nodes to lose power and Ocient can continue to serve queries. If power loss damages a Foundation Node or causes data corruption, the data on the damaged Foundation Node can be rebuilt with the information available in parity bits on other nodes to fully recover lost data.
This might not hold true in simultaneous power loss on more than one Administrator Node or simultaneous power loss on more Foundation Nodes than the number of parity bits in the storage cluster. Such coordinated power loss could leave an Ocient system in a state where Administrator metadata is corrupted or damaged data on Foundation Nodes cannot be fully rebuilt. To prevent simultaneous power loss, the following best practices are recommended.
The risk of complete system loss can be greatly reduced by avoiding sudden loss of power to an Ocient System. Best practices to reduce this risk are:
- NVMe Drives must include PLP.
- Each Ocient node should be configured with redundant power supplies where each power supply is serviced by independent electrical circuits.
- A rack or Data Center (DC) level Uninterruptible Power System (UPS) should be available to provide constant power to Ocient nodes during a sudden power outage.
- When power loss occurs and the UPS system cannot provide power for an extended period of time, you must properly shut down Ocient nodes by using the Startup and Shutdown Procedure.
- Ocient Systems should be hosted in Tier III or Tier IV Data Centers. These Data Centers provide redundant power and cooling systems and guarantee minimal down time.