System Administration
Maintenance Overview

Guide to Rebuilding Segments

As described in the Key Concepts Section of this user guide, data in an system is erasure coded to provide resilience to disk and Foundation node failures. The parity width of the storage space gives the number of associated disks (or nodes) that can fail without interruption of service or loss of data. Data segments can exist in these states:

  • INTACT - the normal state.
  • DAMAGED - the segment failed a checksum and thus its data is considered corrupted and unusable. Damaged segments can be recovered by invoking a rebuild segment task.
  • MISSING - the segment is on a node or disk that is currently offline. The segment will become INTACT if the node or disk rejoins the cluster. However, in cases when a disk or foundation node is permanently removed, a rebuild task will be needed to recover the data.
  • REBUILDING - damaged or missing segment data is being recovered.

If a segment is DAMAGED or MISSING, queries can proceed by reconstructing the data on demand using the remaining erasure-coded data in the segment group. In such instances, I/O performance will be significantly reduced. A segment rebuild task should be run to bring the system back into the normal, full-performance state.

A segment rebuild will fail, and data will be lost completely, if the number of DAMAGED segments in a segment group exceeds the parity width of the storage space. To avoid data loss:

  • Provision the storage space’s parity width at or above the number of expected concurrent node failures.
  • Rebuild damaged or permanently missing segments as soon as possible.

Segment rebuilding is not automatic, it must be manually invoked by the system administrator.

Checking for abnormal segments

Segments needing a rebuild can be found by querying for segment groups with abnormal status:

SQL


To inspect the count of damaged groups by cluster and node, the system tables can be queried with:

SQL


Starting a segment rebuild task

A user with System Administrator privileges can start a segment rebuild using the CREATE TASK TYPE rebuild command. A segment rebuild task cannot be cancelled after it is started. Most commonly, a rebuild is issued as follows to rebuild all damaged or missing segments:

SQL


Queries can proceed before the missing/damaged segments are rebuilt and during rebuild, but I/O performance will be reduced.

Checking rebuild task status

The status of current and past segment rebuild tasks can be monitored from the subtasks virtual table in the system catalog:

SQL


Possible statuses are:

Status

Status detail

Description

Next Steps

complete

no_work

The segment groups were already available and healthy. No rebuilding was needed.

None

complete

complete

Rebuild completed successfully

None

running

rebuild_in_progress

Rebuild is in process.

The percent_complete and elapsed_seconds columns in the sys.rebuild_tasks virtual table can be used to monitor progress.

failed

rebuild_not_possible

The number of missing or damaged segments exceeds the parity width and data cannot currently be recovered.

If missing segments are available on an offline drive, rebuild can be attempted when that drive is made available to the system. Otherwise, data cannot be recovered.

failed

rebuild_no_space

There is not enough space available to rebuild the segment.

Truncation of other data is required to free space in order to complete a rebuild.

failed

failed_on_node

The cluster lost its connection to the node before the rebuild was completed.

This is a transient error. You can retry the rebuild task.

failed

error

An unexpected internal error occurred.

An unexpected internal error occurred. Review error message details using the error_messages column in the sys.rebuild_tasks virtual table.

Advanced Rebuild Commands

Rebuild tasks can also be run on a specific Foundation cluster or node as seen in the following examples:

SQL