Replace Nodes
An System requires a full node replacement in some scenarios. For example, the OS drive on a node might fail and necessitate a reinstallation and addition of the node. Other events, such as a complete node loss due to a hardware failure of the motherboard or damage to NVMe storage drives, can also cause you to replace a node.
An administrator should be familiar with the different types of nodes described in the Core Elements of an Ocient System. There are many similarities between the procedures for different node types, and the differences are clearly noted in the replacement procedures.
This procedure begins with Prerequisites then guides a System Administrator through replacing any of the Ocient node types.
This procedure also includes instructions about the replacement of an OS drive (also called a boot drive) on an Ocient node because the procedure for the replacement is identical to the replacement of the entire node. Both require reinstallation of the operating system and restoration of the configuration backup files to reestablish the identity of the Ocient system.
You must work carefully with nodes that also perform the Administrator Role, because you must maintain Administrator consensus.
Execution of this process requires System Administrator access to the Ocient System Database and sudo privilege on the node being replaced.
You must monitor all nodes such that the loss of a node on the system triggers a node availability alert in the monitoring system.
This table shows the potential consequences of the loss of each type of node.
SQL Node loss
Node Type | Impact of Node Loss |
---|---|
SQL | Might impact query performance in an Ocient System. |
| If the only remaining SQL Node is lost in a system, you cannot query that system. |
| If the SQL Node performs the Administrator Role, then the resiliency of the system could reduce to hardware failure. You must maintain the number of nodes that perform the Administrator Role at all times to maintain the resiliency of the system. |
Foundation Node Loss
Node Type | Impact of Node Loss |
---|---|
Foundation | Can impact query performance or the load of new data. |
| The parity configuration of the storage space in the Ocient system determines if new data can be stored while a Foundation Node is disabled. |
Loader Node Loss
Node Type | Impact of Node Loss |
---|---|
Loader | Can impact the throughput of the data load in an Ocient System. |
| If the remaining Loader Nodes are insufficient to handle the incoming data, the data can lead to a backlog. |
| If the only remaining Loader Node is lost in a system, data loading stops. |
You must replace a node in the event of a hardware failure of certain components. You can replace an OS drive, data drives, and RAM on most nodes. A node replacement might be necessary in the case of a CPU, motherboard, or Network Interface Card (NIC) replacement. The troubleshooting steps to recover an existing node are specific to the hardware model of the node.
Please contact Ocient Support for troubleshooting issues with a node.
- The SQL Nodes on the Ocient System are active and running Ocient (except for any SQL Nodes that are being replaced as part of this procedure).
- The OS has been installed and configured on the replacement node or OS drive for the node to be replaced.
- The replacement node has been installed with the same version of the Ocient Software as the rest of the Ocient System according to the Ocient Application Installation.
- The bootstrap.conf file for the node has been created in the /var/opt/ocient directory on the replacement node. For details, see Ocient System Bootstrapping.
- For Loader Nodes only:
- The replacement node has been installed with the same version of the Loading and Transformation software as the rest of the Loader Nodes as described in LAT Packaging and Installation.
- The following files are available from the node before the failure. If these files are not available, please contact Ocient Support for assistance.
- /etc/lat/lat.conf
- /etc/lat/log4j2.xml
- /opt/lat/.lat-data/*
The replacement procedure is similar for all node types, but there are differences due to the way that different node types coordinate with one another.
The general procedure is:
- Remove the failed node from the Ocient System.
- Register the new node with the system and Accept Node.
- Assign roles and add to clusters where appropriate.
- Start the new node.
Before you replace a SQL Node, check whether the node performs the Administrator Role in addition to the SQL Node. If the node performs the Administrator Role, then follow this additional step to add this role to the new node.
Most SQL Nodes are configured to also perform the Administrator Role. When you remove nodes that perform the Administrator Role, you must replace that node with another node that performs the Administrator Role so that you maintain the level of redundancy in the RAFT protocol of the administrator.
Check the results by using this query.
Remove the node role using the ALTER NODE command by replacing sql0 with the node name of your node:
Delete the node using the DROP NODE command by replacing sql0 with the node name of your node:
If the failed node still exists in the Ocient System, the replacement node will not be added.
Follow the Bootstrapping Node Procedure to register the new node.
When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps.
To add a role, you must be a system-level user. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:
If the node had the Administrator Role in Step 1, then add the Administrator Role back to the node. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:
Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node:
Remove the node role using the ALTER NODE command by replacing loader0 with the name of your node:
Delete the node using the DROP NODE command, replacing loader0 with the name of your node:
If the failed node still exists in the Ocient System, the replacement node will not be added.
Follow the Bootstrapping Node Procedure to register the new node.
When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps.
To add a role, you must be a system-level user. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:
When adding the streamloader role, the node must be restarted if it is already running in order for the change to take effect.
Stop the LAT service by executing the following command at the shell prompt:
The LAT server configuration should be restored from the previous configuration. If a backup is unavailable, please contact Ocient Support for assistance or reference the other LAT configuration on the cluster to modify the default files.
- Copy the lat.conf backup file to /etc/lat/lat.conf.
- Copy the log4j2.xml backup file to /etc/lat/log4j2.xml.
Ensure that these two files have the appropriate ownership and permissions by executing the following commands:
Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node:
Start the LAT service:
If a continuously streaming pipeline such as loading from is operating on the Ocient System, then the Ocient System requires you to update the restored LAT to run the same pipeline to restore full loading throughput. This process is different if a long-running File Load is in process.
Kafka pipeline procedure
Pipeline Type | Procedure |
---|---|
Kafka Pipeline | 1. From the LAT Client, create the pipeline on the new LAT Node. This ensures the pipeline_id matches the existing LAT processes. |
| 2. Start the pipeline on the new LAT Node. The new Loader Node will join the existing Loader Nodes and the load will automatically rebalance across all participating Loaders with the same pipeline_id |
File load pipeline
Pipeline Type | Procedure |
---|---|
File Load Pipeline | File loads from sources like S3, local file system, or NFS require rebalancing of partitions in the event that a node is lost or added back to an Ocient System. In the event a node is lost, it is recommended that a rebalance is performed and the file load completes using the remaining nodes. The newly added node is not expected to participate in an already active file load. |
Follow the Bootstrapping Node Procedure to register the new node. When bootstrapping the new node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps. These examples use foundation_new for the new node name.
If you would like to reuse the original node name (the name of the node that is being removed), you must first rename the original node to prevent conflict. For example, if the original node was named foundation_original, this can be performed with the command ALTER NODE "foundation_original" RENAME TO "foundation_to_drop";
To operate as a Foundation Node, the node must be added to a storage cluster. To add a node to a storage cluster, you must be a system-level user. Run the following command from the SQL prompt by replacing storage_cluster_1 and foundation_new with your storage cluster name and node name respectively:
Restart the rolehostd process on the new node by running this command at the shell terminal on the new node:
The node should now be operational.
Remove the original node from all clusters to which it belongs. For example, if the existing node is foundation_original and your storage cluster name is storage_cluster_1, run the following command:
Delete the node using the DROP NODE command, replacing foundation_original with the name of your node:
If you renamed the failed original node in Step 1, be sure to use the correct name when dropping from the storage cluster and dropping the node.
In most cases, after removing the original node, administrators follow the Guide to Rebuilding Segments to rebuild segments that were on the replaced Foundation Node.
A Metadata Node is a node that only performs the Administrator Role.
Delete the node using the DROP NODE command by replacing standaloneAdmin00 with the name of your node:
If the failed node still exists in the Ocient System, the replacement node will not be added.
Follow the Bootstrapping Node Procedure to register the new node.
When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps.
To add a role, you must be a system-level user. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:
Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node:
To add a node to the system, follow the Bootstrap the Remaining Nodes process after meeting the prerequisites described in Ocient System Bootstrapping. These steps summarize the bootstrapping process.
All prerequisites should be validated prior to bootstrapping. Ensure that you have set the correct adminUserName and adminPassword in the bootstrap.conf file. Any advanced configuration parameters should also be specified in the bootstrap.conf file.
Step 1: Create a /var/opt/ocient/bootstrap.conf file on the node by replacing the <FIRST_NODE_ADDRESS> with the IP or Hostname of a node running the Administrator Role and the correct user name and password for your system:
Step 2: Start the new node.
This will accept the node into the system and make it ready for role assignment.
ALTER NODE
DROP NODE