Guide to Rebuilding Segments

As described in the Key Concepts Section of this user guide, data in an 
﻿ system is erasure coded to provide resilience to disk and Foundation node failures. The parity width of the storage space gives the number of associated disks (or nodes) that can fail without interruption of service or loss of data. Data segments can exist in these states:
INTACT - the normal state.

DAMAGED - the segment failed a checksum and thus its data is considered corrupted and unusable. Damaged segments can be recovered by invoking a rebuild segment task.

MISSING - the segment is on a node or disk that is currently offline. The segment will become INTACT if the node or disk rejoins the cluster. However, in cases when a disk or foundation node is permanently removed, a rebuild task will be needed to recover the data.

REBUILDING - damaged or missing segment data is being recovered.
If a segment is DAMAGED or MISSING, queries can proceed by reconstructing the data on demand using the remaining erasure-coded data in the segment group. In such instances, I/O performance will be significantly reduced. A segment rebuild task should be run to bring the system back into the normal, full-performance state.
A segment rebuild will fail, and data will be lost completely, if the number of DAMAGED segments in a segment group exceeds the parity width of the storage space. To avoid data loss:
Provision the storage space’s parity width at or above the number of expected concurrent node failures.

Rebuild damaged or permanently missing segments as soon as possible.
Segment rebuilding is not automatic, it must be manually invoked by the system administrator.
Checking for abnormal segments
Segments needing a rebuild can be found by querying for segment groups with abnormal status:
SQL
SELECT * FROM sys.segment_groups WHERE status IN ('DAMAGED', 'MISSING', 'REBUILDING');
SELECT * FROM sys.segment_groups WHERE status IN ('DAMAGED', 'MISSING', 'REBUILDING');
﻿
To inspect the count of damaged groups by cluster and node, the system tables can be queried with:
SQL
SELECT
  c.name as cluster_name,
  n.name as node_name,
  g.status as segment_group_status,
  seg.status as segment_status,
  seg.kind,
  count(*) as segment_count
FROM
  sys.segment_groups g
LEFT JOIN
  sys.clusters c
  ON c.id = g.cluster_id
LEFT JOIN
  sys.stored_segments seg
  ON seg.segment_group_id = g.id
LEFT JOIN
  sys.nodes n
  ON n.id = seg.node_id
WHERE
  g.status <> 'INTACT'
  AND (seg.status <> 'INTACT' OR seg.status IS NULL)
  GROUP BY 1,2,3,4,5
  ORDER BY 1,2,3,4,5;
SELECT
  c.name as cluster_name,
  n.name as node_name,
  g.status as segment_group_status,
  seg.status as segment_status,
  seg.kind,
  count(*) as segment_count
FROM
  sys.segment_groups g
LEFT JOIN
  sys.clusters c
  ON c.id = g.cluster_id
LEFT JOIN
  sys.stored_segments seg
  ON seg.segment_group_id = g.id
LEFT JOIN
  sys.nodes n
  ON n.id = seg.node_id
WHERE
  g.status <> 'INTACT'
  AND (seg.status <> 'INTACT' OR seg.status IS NULL)
  GROUP BY 1,2,3,4,5
  ORDER BY 1,2,3,4,5;
﻿
Starting a segment rebuild task
A user with System Administrator privileges can start a segment rebuild using the CREATE TASK TYPE rebuild command. A segment rebuild task cannot be cancelled after it is started. Most commonly, a rebuild is issued as follows to rebuild all damaged or missing segments:
SQL
CREATE TASK TYPE rebuild;
CREATE TASK TYPE rebuild;
﻿
Queries can proceed before the missing/damaged segments are rebuilt and during rebuild, but I/O performance will be reduced.
Checking rebuild task status
The status of current and past segment rebuild tasks can be monitored from the subtasks virtual table in the system catalog:
SQL
SELECT * FROM sys.subtasks WHERE task_type = 'rebuild';
SELECT * FROM sys.subtasks WHERE task_type = 'rebuild';
﻿
Possible statuses are:
Status
Status detail
Description
Next Steps
complete
no_work
The segment groups were already available and healthy. No rebuilding was needed.
None
complete
complete
Rebuild completed successfully
None
running
rebuild_in_progress
Rebuild is in process.
The percent_complete and elapsed_seconds columns in the sys.rebuild_tasks virtual table can be used to monitor progress.
failed
rebuild_not_possible
The number of missing or damaged segments exceeds the parity width and data cannot currently be recovered.
If missing segments are available on an offline drive, rebuild can be attempted when that drive is made available to the system. Otherwise, data cannot be recovered.
failed
rebuild_no_space
There is not enough space available to rebuild the segment.
Truncation of other data is required to free space in order to complete a rebuild.
failed
failed_on_node
The cluster lost its connection to the node before the rebuild was completed.
This is a transient error. You can retry the rebuild task.
failed
error
An unexpected internal error occurred.
An unexpected internal error occurred. Review error message details using the error_messages column in the sys.rebuild_tasks virtual table.
Advanced Rebuild Commands
Rebuild tasks can also be run on a specific Foundation cluster or node as seen in the following examples:
SQL
--Rebuild segments on a single cluster
CREATE TASK TYPE rebuild OPTIONS ('CLUSTER', 'my_lts_cluster1');

-- Rebuild the whole system
CREATE TASK TYPE rebuild;

-- Rebuild segments on a specific node
CREATE TASK TYPE rebuild OPTIONS ('CLUSTER', 'my_lts_cluster1', '{"node": "my_node1"}');
--Rebuild segments on a single cluster
CREATE TASK TYPE rebuild OPTIONS ('CLUSTER', 'my_lts_cluster1');

-- Rebuild the whole system
CREATE TASK TYPE rebuild;

-- Rebuild segments on a specific node
CREATE TASK TYPE rebuild OPTIONS ('CLUSTER', 'my_lts_cluster1', '{"node": "my_node1"}');
﻿
﻿

Status	Status detail	Description	Next Steps
complete	no_work	The segment groups were already available and healthy. No rebuilding was needed.	None
complete	complete	Rebuild completed successfully	None
running	rebuild_in_progress	Rebuild is in process.	The percent_complete and elapsed_seconds columns in the sys.rebuild_tasks virtual table can be used to monitor progress.
failed	rebuild_not_possible	The number of missing or damaged segments exceeds the parity width and data cannot currently be recovered.	If missing segments are available on an offline drive, rebuild can be attempted when that drive is made available to the system. Otherwise, data cannot be recovered.
failed	rebuild_no_space	There is not enough space available to rebuild the segment.	Truncation of other data is required to free space in order to complete a rebuild.
failed	failed_on_node	The cluster lost its connection to the node before the rebuild was completed.	This is a transient error. You can retry the rebuild task.
failed	error	An unexpected internal error occurred.	An unexpected internal error occurred. Review error message details using the error_messages column in the sys.rebuild_tasks virtual table.

Updated 17 Oct 2023

Did this page help you?

Maintenance Overview

Startup and Shutdown Procedure