System Administration
Maintenance Overview

Replace Nodes

An system requires a full node replacement in some scenarios. For example, the OS drive on a node might fail and necessitate a reinstallation and addition of the node. Other events, such as a complete node loss due to a hardware failure of the motherboard or damage to NVMe storage drives, can also cause you to replace a node.

An administrator should be familiar with the different types of nodes described in the Key Concepts. There are many similarities between the procedures for different node types, and the differences are clearly noted in the replacement procedures.

This procedure begins with Prerequisites then guides a System Administrator through replacing any of the Ocient node types.

This procedure also includes instructions about the replacement of an OS drive (also called a boot drive) on an Ocient Node because the procedure for the replacement is identical to the replacement of the entire node. Both require reinstallation of the operating system and restoration of the configuration backup files to reestablish the identity of the Ocient system.

You must work carefully with nodes that also perform the Administrator Role, because you must maintain Administrator consensus.

Execution of this process requires System Administrator access to the Ocient System Database and sudo privilege on the node being replaced.

Detection and Alerting

You must monitor all nodes such that the loss of a node on the system triggers a node availability alert in the monitoring system.

This table shows the potential consequences of the loss of each type of node.

SQL Node loss

Node Type

Impact of Node Loss

SQL

- Might impact query performance in an Ocient System.



- If the only remaining SQL Node is lost in a system, you cannot query that system.



- If the SQL Node performs the Administrator Role, then the resiliency of the system could reduce to hardware failure. You must maintain the number of nodes that perform the Administrator Role at all times to maintain the resiliency of the system.

Foundation Node Loss

Node Type

Impact of Node Loss

Foundation

- Can impact query performance or the load of new data.



- The parity configuration of the Storage Space in the Ocient system determines if new data can be stored while a Foundation Node is disabled.

Loader Node Loss

Node Type

Impact of Node Loss

Loader

- Can impact the throughput of the data load in an Ocient System.



- If the remaining Loader Nodes are insufficient to handle the incoming data, the data can lead to a backlog.



- If the only remaining Loader Node is lost in a system, data loading stops.

Troubleshooting

You must replace a node in the event of a hardware failure of certain components. You can replace an OS drive, data drives, and RAM on most nodes. A node replacement might be necessary in the case of a CPU, motherboard, or Network Interface Card (NIC) replacement. The troubleshooting steps to recover an existing node are specific to the hardware model of the node.

Please contact Ocient Support for troubleshooting issues with a node.

Recovery

Prerequisites

  • The SQL Nodes on the Ocient System are active and running Ocient (except for any SQL Nodes that are being replaced as part of this procedure).
  • The OS has been installed and configured on the replacement node or OS drive for the node to be replaced.
  • The replacement node has been installed with the same version of the Ocient Software as the rest of the Ocient System according to the Ocient Application Installation.
  • The bootstrap.conf file for the node has been created in the /var/opt/ocient directory on the replacement node. For details, see Ocient System Bootstrapping.
  • For Loader Nodes only:
    • The replacement node has been installed with the same version of the Loading and Transformation software as the rest of the Loader Nodes as described in LAT Packaging and Installation.
    • The following files are available from the node before the failure. If these files are not available, please contact Ocient Support for assistance.
      • /etc/lat/lat.conf
      • /etc/lat/log4j2.xml
      • /opt/lat/.lat-data/*

Replacement Procedure Overview

The replacement procedure is similar for all node types, but there are differences due to the way that different node types coordinate with one another.

The general procedure is:

  1. Remove the failed node from the Ocient System.
  2. Register the new node with the system and Accept Node.
  3. Assign roles and add to clusters where appropriate.
  4. Start the new node.

SQL Node Replacement Procedure

Step 1: Check SQL Node for the Administrator Role.

Before you replace a SQL Node, check whether the node performs the Administrator Role in addition to the SQL Node. If the node performs the Administrator Role, then follow this additional step to add this role to the new node.

Most SQL Nodes are configured to also perform the Administrator Role. When you remove nodes that perform the Administrator Role, you must replace that node with another node that performs the Administrator Role so that you maintain the level of redundancy in the RAFT protocol of the administrator.

Check the results by using this query.

SQL

Text


Step 2: Remove the failed SQL Node from the Ocient System.

Remove the node role using the ALTER NODE command by replacing sql0 with the node name of your node:

SQL


Delete the node using the DROP NODE command by replacing sql0 with the node name of your node:

SQL


If the failed node still exists in the Ocient System, the replacement node will not be added.

Step 3: Register the new node with the Ocient System.

Follow the Bootstrapping Node Procedure to register the new node.

When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps.

Step 4: Add the SQL Role to the new node.

To add a role, you must be a system-level user. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:

SQL


Step 5: Add the Administrator Role. (Only for SQL Nodes with the Administrator Role)

If the node had the Administrator Role in Step 1, then add the Administrator Role back to the node. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:

SQL


Step 6: Restart the Node.

Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node:

Shell



Loader Node Replacement Procedure

Step 1: Remove the failed Loader Node from the Ocient System.

Remove the node role using the ALTER NODE command by replacing loader0 with the name of your node:

SQL


Delete the node using the DROP NODE command, replacing loader0 with the name of your node:

SQL


If the failed node still exists in the Ocient System, the replacement node will not be added.

Step 2: Register the new node with the Ocient System.

Follow the Bootstrapping Node Procedure to register the new node.

When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps.

Step 3: Add the streamloader Role to the new node.

To add a role, you must be a system-level user. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:

SQL


When adding the streamloader role, the node must be restarted if it is already running in order for the change to take effect.

Step 4: Configure the LAT service.

Stop the LAT service by executing the following command at the shell prompt:

Shell


The LAT server configuration should be restored from the previous configuration. If a backup is unavailable, please contact Ocient Support for assistance or reference the other LAT configuration on the cluster to modify the default files.

  • Copy the lat.conf backup file to /etc/lat/lat.conf.
  • Copy the log4j2.xml backup file to /etc/lat/log4j2.xml.

Ensure that these two files have the appropriate ownership and permissions by executing the following commands:

Shell


Step 5: Restart the Node.

Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node:

Shell


Start the LAT service:

Shell


Step 6: Update the LAT pipeline if required.

If a continuously streaming pipeline such as loading from is operating on the Ocient System, then the Ocient System requires you to update the restored LAT to run the same pipeline to restore full loading throughput. This process is different if a long-running File Load is in process.

Kafka pipeline procedure

Pipeline Type

Procedure

Kafka Pipeline

1. From the LAT Client, create the pipeline on the new LAT node. This ensures the pipeline_id matches the existing LAT processes.



2. Start the pipeline on the new LAT node. The new Loader Node will join the existing Loader Nodes and the load will automatically rebalance across all participating Loaders with the same pipeline_id

File load pipeline

Pipeline Type

Procedure

File Load Pipeline

File loads from sources like S3, local file system, or NFS require rebalancing of partitions in the event that a node is lost or added back to an Ocient System.



In the event a node is lost, it is recommended that a rebalance is performed and the file load completes using the remaining nodes. The newly added node is not expected to participate in an already active file load.



Foundation Node Replacement Procedure

Step 1: Register the new node with the Ocient System.

Follow the Bootstrapping Node Procedure to register the new node. When bootstrapping the new node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps. These examples use foundation_new for the new node name.

If you would like to reuse the original node name (the name of the node that is being removed), you must first rename the original node to prevent conflict. For example, if the original node was named foundation_original, this can be performed with the command ALTER NODE "foundation_original" RENAME TO "foundation_to_drop";

Step 2: Add the new node to a specific storage cluster.

To operate as a Foundation Node, the node must be added to a storage cluster. To add a node to a storage cluster, you must be a system-level user. Run the following command from the SQL prompt by replacing storage_cluster_1 and foundation_new with your storage cluster name and node name respectively:

SQL


Step 3: Restart the new node.

Restart the rolehostd process on the new node by running this command at the shell terminal on the new node:

Shell


The node should now be operational.

Step 4: Remove the failed Foundation Node from the Ocient System.

Remove the original node from all clusters to which it belongs. For example, if the existing node is foundation_original and your storage cluster name is storage_cluster_1, run the following command:

SQL


Delete the node using the DROP NODE command, replacing foundation_original with the name of your node:

SQL


If you renamed the failed original node in Step 1, be sure to use the correct name when dropping from the storage cluster and dropping the node.

Step 5: Rebuild segments.

In most cases, after removing the original node, administrators follow the Guide to Rebuilding Segments to rebuild segments that were on the replaced Foundation Node.



Metadata Node Replacement Procedure

A Metadata Node is a node that only performs the Administrator Role.

Step 1: Remove the failed Administrator node from the Ocient System.

Delete the node using the DROP NODE command by replacing standaloneAdmin00 with the name of your node:

SQL


If the failed node still exists in the Ocient System, the replacement node will not be added.

Step 2: Register the new node with the Ocient System.

Follow the Bootstrapping Node Procedure to register the new node.

When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps.

Step 3: Add the Administrator Role to the new node.

To add a role, you must be a system-level user. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:

SQL


Step 4: Restart the Node.

Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node:

Shell



Bootstrapping Node Procedure

To add a node to the system, follow the Bootstrap the Remaining Nodes process after meeting the prerequisites described in Ocient System Bootstrapping. These steps summarize the bootstrapping process.

All prerequisites should be validated prior to bootstrapping. Ensure that you have set the correct adminUserName and adminPassword in the bootstrap.conf file. Any advanced configuration parameters should also be specified in the bootstrap.conf file.

Step 1: Create a /var/opt/ocient/bootstrap.conf file on the node by replacing the <FIRST_NODE_ADDRESS> with the IP or Hostname of a node running the Administrator Role and the correct user name and password for your system:

YAML


Step 2: Start the new node.

Shell


This will accept the node into the system and make it ready for role assignment.