System Administration
Maintenance Overview
Guide to Rebuilding Segments
data in an {{ocient}} system is erasure coded to provide resilience to disk and foundation node failures the resilience of a system depends on the parity width of the storage space, which represents the number of associated disks (or nodes) that can fail without interruption of service or data loss for details about storage spaces and parity width, see configure storage spaces docid\ n cqwc cwil7x0sjsabcn data segment statuses you can check the status of your system segment groups by querying the sys segment groups system catalog table for details, see system catalog docid\ v571glhmrgmaqsa2dzm r this table describes the states for data segments status description recovery process intact t he normal and operable status no recovery needed damaged the segment failed a checksum, meaning the segment data is corrupted and unusable if you have sufficient parity width, you can recover a damaged segment by invoking a rebuild segment task missing the segment is on a node or disk that is currently offline if the node or disk rejoins the cluster, it can transition to the intact status when a disk or foundation node is permanently removed, you can perform a rebuild task to recover the data this requires sufficient parity width rebuilding the segment is in recovery after damage or missing segment data recovery is already in progress recovery considerations if a segment has the damaged or missing status, queries can proceed by reconstructing the data on demand using the remaining erasure coded data in the segment group having a non intact status means that i/o performance is significantly reduced to restore full performance, you need to run a segment rebuild task segment rebuilding is not automatic the system administrator must manually invoke it a segment rebuild fails, and data is lost completely if the number of segments with the damaged status in a segment group exceeds the parity width of the storage space to avoid data loss provision the parity width of the storage space at or above the number of expected concurrent node failures rebuild damaged or permanently missing segments as soon as possible checking for abnormal segments you can find any segments that need a rebuild by querying for segment groups with an abnormal status examples finding faulty segment groups this example query finds any segment groups that have the damaged or missing statuses select from sys segment groups where status in ('damaged', 'missing'); output \| "id" | "cluster id" | "segment type" | "status" | "primary owner" | "loader id" | "table id" | "scope id" | "block size" | "begin time" | "end time" | "coding algorithm" | "coding block size" | "coding threshold" | "coding width" | "replication" | "parity cycle" | "created time" | "rolehostd version" | "commit hash" | "timestamp" | "build user" | "depth" | "removal type" | \| | | | | | | | | | | | | | | | | | | | | | | | | \| "720576045199095878" | "074c32ec 4f92 4718 ada1 5eb55eafdcb3" | tkt segment | damaged | | "9a8f7ea7 613c 499f ab2d 937ab0ce992e" | bd18d33b aae9 4be2 a2f4 76ecdc34c2ba | "8bc119d2 875a 47e4 b016 87fde83e77d6" | 4096 | 0 | 1 | xor parity | 4096 | 2 | 3 | 1 | 1 | 2025 02 26 20 23 34 274 | "25 0 0" | "91ab76ae57491ade0122bc7b594d5c3c6e0bf40c" | "20250109 221519" | | 0 | not removed | \| "716072445571725367" | "074c32ec 4f92 4718 ada1 5eb55eafdcb3" | tkt segment | damaged | | "9a8f7ea7 613c 499f ab2d 937ab0ce992e" | "9fe79009 ee75 4429 b29b 3a614a166751" | a8c805d0 2144 46a7 8374 52cb88d64244 | 4096 | 0 | 1 | xor parity | 4096 | 2 | 3 | 1 | 1 | 2025 02 26 20 23 19 447 | "25 0 0" | "91ab76ae57491ade0122bc7b594d5c3c6e0bf40c" | "20250109 221519" | | 0 | not removed | finding clusters and nodes with faulty segment groups inspect the count of damaged groups by cluster and node select c name as cluster name, n name as node name, g status as segment group status, seg status as segment status, seg kind, count( ) as segment count from sys segment groups g left join sys clusters c on c id = g cluster id left join sys stored segments seg on seg segment group id = g id left join sys nodes n on n id = seg node id where g status <> 'intact' and (seg status <> 'intact' or seg status is null) group by 1,2,3,4,5 order by 1,2,3,4,5; output \| "cluster name" | "node name" | "segment group status" | "segment status" | "kind" | "segment count" | \| | | | | | | \| foundation cluster | foundation0 | damaged | | virtual | 2 | starting a segment rebuild task a user with system administrator privileges can start a segment rebuild using the create task type rebuild sql statement you cannot cancel a segment rebuild task after it is started most commonly, a rebuild task repairs all damaged or missing segments across the system example create a rebuild task create task type rebuild; the system can continue to perform queries while rebuilding segments, but the process can impact i/o performance advanced rebuild commands rebuild tasks can also execute on specific foundation nodes or clusters for information on fine tuning rebuild tasks, see distributed tasks docid ak14li9mhye9szzwi1t5 checking rebuild task status monitor the status of current and past segment rebuild tasks from the sys subtasks system catalog table for details, see system catalog docid\ v571glhmrgmaqsa2dzm r select from sys subtasks where task type = 'rebuild'; this table describes the statuses for a rebuild task status status detail description next steps complete no work the segment groups were already available and healthy no rebuilding was needed none complete complete rebuild completed successfully none running rebuild in progress rebuild is in process you can monitor the progress of the rebuild task by checking the percent complete and elapsed seconds columns in the sys rebuild tasks system catalog table failed rebuild not possible the number of missing or damaged segments exceeds the parity width, and the data cannot be recovered currently if missing segments are available on an offline drive, you can attempt another rebuild task when that drive is made available to the system otherwise, you cannot recover the data failed rebuild no space there is not enough space available to rebuild the segment you can complete the rebuild by truncating other data to free up space failed failed on node the cluster lost its connection to the node before the rebuild was completed this is a transient error you can retry the rebuild task failed error an unexpected internal error occurred r eview error message details using t he error messages column in the sys rebuild tasks system catalog table related links distributed tasks docid ak14li9mhye9szzwi1t5 compute adjacent input and output on large working sets docid\ phh8hm6adkz sz0czkod2