Replace Nodes

An  System requires a full node replacement in some scenarios. For example, the OS drive on a node might fail and necessitate a reinstallation and addition of the node. Other events, such as a complete node loss due to a hardware failure of the motherboard or damage to NVMe storage drives, can also cause you to replace a node.
An administrator should be familiar with the different types of nodes described in the Core Elements of an Ocient System﻿. There are many similarities between the procedures for different node types, and the differences are clearly noted in the replacement procedures.
This procedure begins with Prerequisites﻿ then guides a System Administrator through replacing any of the Ocient node types.
This procedure also includes instructions about the replacement of an OS drive (also called a boot drive) on an Ocient node because the procedure for the replacement is identical to the replacement of the entire node. Both require reinstallation of the operating system and restoration of the configuration backup files to reestablish the identity of the Ocient system.
You must work carefully with nodes that also perform the Administrator Role, because you must maintain Administrator consensus.
Execution of this process requires System Administrator access to the Ocient System Database and sudo privilege on the node being replaced.
Detection and Alerting
You must monitor all nodes such that the loss of a node on the system triggers a node availability alert in the monitoring system.
This table shows the potential consequences of the loss of each type of node.
SQL Node loss
Node Type
Impact of Node Loss
SQL
Might impact query performance in an Ocient System.
﻿
If the only remaining SQL Node is lost in a system, you cannot query that system.
﻿
If the SQL Node performs the Administrator Role, then the resiliency of the system could reduce to hardware failure. You must maintain the number of nodes that perform the Administrator Role at all times to maintain the resiliency of the system.
Foundation Node Loss
Node Type
Impact of Node Loss
Foundation
Can impact query performance or the load of new data.

﻿
The parity configuration of the storage space in the Ocient system determines if new data can be stored while a Foundation Node is disabled.
Loader Node Loss
Node Type
Impact of Node Loss
Loader
Can impact the throughput of the data load in an Ocient System.
﻿
If the remaining Loader Nodes are insufficient to handle the incoming data, the data can lead to a backlog.
﻿
If the only remaining Loader Node is lost in a system, data loading stops.
Troubleshooting
You must replace a node in the event of a hardware failure of certain components. You can replace an OS drive, data drives, and RAM on most nodes. A node replacement might be necessary in the case of a CPU, motherboard, or Network Interface Card (NIC) replacement. The troubleshooting steps to recover an existing node are specific to the hardware model of the node.
Please contact Ocient Support for troubleshooting issues with a node.
Recovery
Prerequisites
The SQL Nodes on the Ocient System are active and running Ocient (except for any SQL Nodes that are being replaced as part of this procedure).

The OS has been installed and configured on the replacement node or OS drive for the node to be replaced.

The replacement node has been installed with the same version of the Ocient Software as the rest of the Ocient System according to the Ocient Application Installation﻿.

The bootstrap.conf file for the node has been created in the /var/opt/ocient directory on the replacement node. For details, see Ocient System Bootstrapping﻿.

For Loader Nodes only:

The replacement node has been installed with the same version of the Loading and Transformation software as the rest of the Loader Nodes as described in LAT Packaging and Installation﻿.

The following files are available from the node before the failure. If these files are not available, please contact Ocient Support for assistance.

/etc/lat/lat.conf

/etc/lat/log4j2.xml

/opt/lat/.lat-data/*
Replacement Procedure Overview
The replacement procedure is similar for all node types, but there are differences due to the way that different node types coordinate with one another.
The general procedure is:
Remove the failed node from the Ocient System.

Register the new node with the system and Accept Node.

Assign roles and add to clusters where appropriate.

Start the new node.
SQL Node Replacement Procedure
Step 1: Check SQL Node for the Administrator Role.
Before you replace a SQL Node, check whether the node performs the Administrator Role in addition to the SQL Node. If the node performs the Administrator Role, then follow this additional step to add this role to the new node.
Most SQL Nodes are configured to also perform the Administrator Role. When you remove nodes that perform the Administrator Role, you must replace that node with another node that performs the Administrator Role so that you maintain the level of redundancy in the RAFT protocol of the administrator.
Check the results by using this query.
SQL
SELECT
  n.name AS node_name,
  sr.service_role_type
FROM sys.nodes n
LEFT JOIN sys.service_roles sr ON sr.node_id = n.id
WHERE sr.service_role_type = 'admin';
SELECT
  n.name AS node_name,
  sr.service_role_type
FROM sys.nodes n
LEFT JOIN sys.service_roles sr ON sr.node_id = n.id
WHERE sr.service_role_type = 'admin';
﻿
Text
node_name service_role_type
--------------------------------------------------------------------------------------
----
sql0 admin
sql1 admin
sql2 admin
node_name service_role_type
--------------------------------------------------------------------------------------
----
sql0 admin
sql1 admin
sql2 admin
﻿
Step 2: Remove the Failed SQL Node from the Ocient System.
Remove the node role using the ALTER NODE command by replacing sql0 with the node name of your node:
SQL
ALTER NODE "sql0" REMOVE ROLE sql;
ALTER NODE "sql0" REMOVE ROLE sql;
﻿
Delete the node using the DROP NODE command by replacing sql0 with the node name of your node:
SQL
DROP NODE "sql0";
DROP NODE "sql0";
﻿
If the failed node still exists in the Ocient System, the replacement node will not be added.
Step 3: Register the New Node with the Ocient System.
Follow the Bootstrapping Node Procedure﻿ to register the new node.
When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps.
Step 4: Add the SQL Role to the New Node.
To add a role, you must be a system-level user. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:
SQL
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" ADD ROLE sql;
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" ADD ROLE sql;
﻿
Step 5: Add the Administrator Role. (Only for SQL Nodes with the Administrator Role)
If the node had the Administrator Role in Step 1, then add the Administrator Role back to the node. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:
SQL
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" ADD ROLE admin;
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" ADD ROLE admin;
﻿
Step 6: Restart the Node.
Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node:
Shell
sudo systemctl restart rolehostd
sudo systemctl restart rolehostd
﻿
﻿
Loader Node Replacement Procedure
Step 1: Remove the Failed Loader Node from the Ocient System.
Remove the node role using the ALTER NODE command by replacing loader0 with the name of your node:
SQL
ALTER NODE "loader0" REMOVE ROLE streamloader;
ALTER NODE "loader0" REMOVE ROLE streamloader;
﻿
Delete the node using the DROP NODE command, replacing loader0 with the name of your node:
SQL
DROP NODE "loader0";
DROP NODE "loader0";
﻿
If the failed node still exists in the Ocient System, the replacement node will not be added.
Step 2: Register the New Node with the Ocient System.
Follow the Bootstrapping Node Procedure﻿ to register the new node.
When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps.
Step 3: Add the streamloader Role to the New Node.
To add a role, you must be a system-level user. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:
SQL
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" add role streamloader;
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" add role streamloader;
﻿
When adding the streamloader role, the node must be restarted if it is already running in order for the change to take effect.
Step 4: Configure the LAT Service.
Stop the LAT service by executing the following command at the shell prompt:
Shell
sudo systemctl stop lat
sudo systemctl stop lat
﻿
The LAT server configuration should be restored from the previous configuration. If a backup is unavailable, please contact Ocient Support for assistance or reference the other LAT configuration on the cluster to modify the default files.
Copy the lat.conf backup file to /etc/lat/lat.conf.

Copy the log4j2.xml backup file to /etc/lat/log4j2.xml.
Ensure that these two files have the appropriate ownership and permissions by executing the following commands:
Shell
chown lat:lat /etc/lat/lat.conf /etc/lat/log4j2.xml 
chmod 644 /etc/lat/lat.conf /etc/lat/log4j2.xml
chown lat:lat /etc/lat/lat.conf /etc/lat/log4j2.xml 
chmod 644 /etc/lat/lat.conf /etc/lat/log4j2.xml
﻿
Step 5: Restart the Node.
Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node:
Shell
sudo systemctl restart rolehostd
sudo systemctl restart rolehostd
﻿
Start the LAT service:
Shell
sudo systemctl start lat
sudo systemctl start lat
﻿
Step 6: Update the LAT Pipeline if Required.
If a continuously streaming pipeline such as loading from  is operating on the Ocient System, then the Ocient System requires you to update the restored LAT to run the same pipeline to restore full loading throughput. This process is different if a long-running File Load is in process.
Kafka pipeline procedure
Pipeline Type
Procedure
Kafka Pipeline
1. From the LAT Client, create the pipeline on the new LAT Node. This ensures the pipeline_id matches the existing LAT processes.﻿﻿
﻿
2. Start the pipeline on the new LAT Node. The new Loader Node will join the existing Loader Nodes and the load will automatically rebalance across all participating Loaders with the same pipeline_id
File load pipeline
Pipeline Type
Procedure
File Load Pipeline
File loads from sources like S3, local file system, or NFS require rebalancing of partitions in the event that a node is lost or added back to an Ocient System.﻿
﻿
In the event a node is lost, it is recommended that a rebalance is performed and the file load completes using the remaining nodes. The newly added node is not expected to participate in an already active file load.
﻿
Foundation Node Replacement Procedure
Step 1: Register the New Node with the Ocient System.
Follow the Bootstrapping Node Procedure﻿ to register the new node. When bootstrapping the new node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps. These examples use foundation_new for the new node name.
If you would like to reuse the original node name (the name of the node that is being removed), you must first rename the original node to prevent conflict. For example, if the original node was named foundation_original, this can be performed with the command  ALTER NODE "foundation_original" RENAME TO "foundation_to_drop";
Step 2: Add the New Node to a Specific Storage Cluster.
To operate as a Foundation Node, the node must be added to a storage cluster. To add a node to a storage cluster, you must be a system-level user. Run the following command from the SQL prompt by replacing storage_cluster_1 and foundation_new with your storage cluster name and node name respectively:
SQL
ALTER CLUSTER "storage_cluster_1" ADD PARTICIPANTS "foundation_new";
ALTER CLUSTER "storage_cluster_1" ADD PARTICIPANTS "foundation_new";
﻿
Step 3: Restart the New Node.
Restart the rolehostd process on the new node by running this command at the shell terminal on the new node:
Shell
sudo systemctl restart rolehostd
sudo systemctl restart rolehostd
﻿
The node should now be operational.
Step 4: Remove the Failed Foundation Node from the Ocient System.
Remove the original node from all clusters to which it belongs. For example, if the existing node is foundation_original and your storage cluster name is storage_cluster_1, run the following command:
SQL
ALTER CLUSTER "storage_cluster_1" DROP PARTICIPANTS "foundation_original";
ALTER CLUSTER "storage_cluster_1" DROP PARTICIPANTS "foundation_original";
﻿
Delete the node using the DROP NODE command, replacing foundation_original with the name of your node:
SQL
DROP NODE "foundation_original";
DROP NODE "foundation_original";
﻿
If you renamed the failed original node in Step 1, be sure to use the correct name when dropping from the storage cluster and dropping the node.
Step 5: Rebuild Segments.
In most cases, after removing the original node, administrators follow the Guide to Rebuilding Segments﻿ to rebuild segments that were on the replaced Foundation Node.
﻿
Metadata Node Replacement Procedure
A Metadata Node is a node that only performs the Administrator Role.
Step 1: Remove the Failed Administrator Node from the Ocient System.
Delete the node using the DROP NODE command by replacing standaloneAdmin00 with the name of your node:
SQL
DROP NODE "standaloneAdmin00";
DROP NODE "standaloneAdmin00";
﻿
If the failed node still exists in the Ocient System, the replacement node will not be added.
Step 2: Register the New Node with the Ocient System.
Follow the Bootstrapping Node Procedure﻿ to register the new node.
When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps.
Step 3: Add the Administrator Role to the New Node.
To add a role, you must be a system-level user. Run the following command from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name:
SQL
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" add role admin;
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" add role admin;
﻿
Step 4: Restart the Node.
Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node:
Shell
sudo systemctl restart rolehostd
sudo systemctl restart rolehostd
﻿
﻿
Bootstrapping Node Procedure
To add a node to the system, follow the Bootstrap the Remaining Nodes﻿ process after meeting the prerequisites described in Ocient System Bootstrapping﻿. These steps summarize the bootstrapping process.
All prerequisites should be validated prior to bootstrapping. Ensure that you have set the correct adminUserName and adminPassword in the bootstrap.conf file. Any advanced configuration parameters should also be specified in the bootstrap.conf file.
Step 1: Create a /var/opt/ocient/bootstrap.conf file on the node by replacing the <FIRST_NODE_ADDRESS> with the IP or Hostname of a node running the Administrator Role and the correct user name and password for your system:
YAML
adminHost: <FIRST_NODE_ADDRESS>
adminUserName: my_admin 
adminPassword: example_password
adminHost: <FIRST_NODE_ADDRESS>
adminUserName: my_admin 
adminPassword: example_password
﻿
Step 2: Start the new node.
Shell
sudo systemctl start rolehostd
sudo systemctl start rolehostd
﻿
This will accept the node into the system and make it ready for role assignment.
Related Links
﻿ALTER NODE﻿﻿
﻿DROP NODE﻿﻿
﻿ALTER CLUSTER﻿﻿
﻿System Catalog﻿﻿

Node Type	Impact of Node Loss
SQL	Might impact query performance in an Ocient System.
	If the only remaining SQL Node is lost in a system, you cannot query that system.
	If the SQL Node performs the Administrator Role, then the resiliency of the system could reduce to hardware failure. You must maintain the number of nodes that perform the Administrator Role at all times to maintain the resiliency of the system.

Node Type	Impact of Node Loss
Foundation	Can impact query performance or the load of new data.
	The parity configuration of the storage space in the Ocient system determines if new data can be stored while a Foundation Node is disabled.

Node Type	Impact of Node Loss
Loader	Can impact the throughput of the data load in an Ocient System.
	If the remaining Loader Nodes are insufficient to handle the incoming data, the data can lead to a backlog.
	If the only remaining Loader Node is lost in a system, data loading stops.

Pipeline Type	Procedure
Kafka Pipeline	1. From the LAT Client, create the pipeline on the new LAT Node. This ensures the pipeline_id matches the existing LAT processes.
	2. Start the pipeline on the new LAT Node. The new Loader Node will join the existing Loader Nodes and the load will automatically rebalance across all participating Loaders with the same pipeline_id

Pipeline Type	Procedure
File Load Pipeline	File loads from sources like S3, local file system, or NFS require rebalancing of partitions in the event that a node is lost or added back to an Ocient System. In the event a node is lost, it is recommended that a rebalance is performed and the file load completes using the remaining nodes. The newly added node is not expected to participate in an already active file load.

Updated 18 Jun 2024

Did this page help you?

Yes

Expand and Rebalance System

Backup and Restore