Skip to main content
An System requires a full node replacement in some scenarios. For example, the OS drive on a node might fail and necessitate a reinstallation and addition of the node. Other events, such as a complete node loss due to a hardware failure of the motherboard or damage to NVMe storage drives, can also cause you to replace a node. An administrator should be familiar with the different types of nodes described in the Core Elements of an Ocient System. There are many similarities between the procedures for different node types, and the differences are clearly noted in the replacement procedures. This procedure begins with Prerequisites then guides a System Administrator through replacing any of the Ocient node types. This procedure also includes instructions about the replacement of an OS drive (also called a boot drive) on an Ocient node because the procedure for the replacement is identical to the replacement of the entire node. Both require reinstallation of the operating system and restoration of the configuration backup files to reestablish the identity of the Ocient System.
You must work carefully with nodes that also perform the Administrator Role, because you must maintain Administrator consensus.
Execution of this process requires System Administrator access to the Ocient System Database and sudo privilege on the node being replaced.

Detection and Alerting

You must monitor all nodes such that the loss of a node on the system triggers a node availability alert in the monitoring system. This table shows the potential consequences of the loss of each type of node. SQL Node loss
Node TypeImpact of Node Loss
SQLMight impact query performance in an Ocient System.
If the only remaining SQL Node is lost in a system, you cannot query that system.
If the SQL Node performs the Administrator Role, then the resiliency of the system could reduce to hardware failure. You must maintain the number of nodes that perform the Administrator Role at all times to maintain the resiliency of the system.
Foundation Node Loss
Node TypeImpact of Node Loss
FoundationCan impact query performance or the load of new data.
The parity configuration of the storage space in the Ocient System determines if new data can be stored while a Foundation Node is disabled.
Loader Node Loss
Node TypeImpact of Node Loss
LoaderCan impact the throughput of the data load in an Ocient System.
If the remaining Loader Nodes are insufficient to handle the incoming data, the data can lead to a backlog.
If the only remaining Loader Node is lost in a system, data loading stops.

Troubleshooting

You must replace a node in the event of a hardware failure of certain components. You can replace an OS drive, data drives, and RAM on most nodes. A node replacement might be necessary in the case of a CPU, motherboard, or Network Interface Card (NIC) replacement. The troubleshooting steps to recover an existing node are specific to the hardware model of the node.
Please contact Ocient Support for troubleshooting issues with a node.

Recovery

Prerequisites

  • The SQL Nodes on the Ocient System are active and running Ocient (except for any SQL Nodes that are being replaced as part of this procedure).
  • The OS has been installed and configured on the replacement node or OS drive for the node to be replaced.
  • The replacement node has been installed with the same version of the Ocient Software as the rest of the Ocient System according to the Ocient Application Installation.
  • The bootstrap.conf file for the node has been created in the /var/opt/ocient directory on the replacement node. For details, see Ocient System Bootstrapping.
  • For Loader Nodes only:
    • The replacement node has been installed with the same version of the Loading and Transformation software as the rest of the Loader Nodes as described in LAT Packaging and Installation.
    • The following files are available from the node before the failure. If these files are not available, please contact Ocient Support for assistance.
      • /etc/lat/lat.conf
      • /etc/lat/log4j2.xml
      • /opt/lat/.lat-data/*

Replacement Procedure Overview

The replacement procedure is similar for all node types, but there are differences due to the way that different node types coordinate with one another. The general procedure is:
  1. Remove the failed node from the Ocient System.
  2. Register the new node with the system and Accept Node.
  3. Assign roles and add to clusters where appropriate.
  4. Start the new node.
If you want to reuse the name of a node, then you must rename the failed node and set the IP address. To perform these actions, use these statements. Rename the node from nodename to newnodename.
SQL
ALTER NODE "nodename" RENAME TO "newnodename";
Set the IP address for the new node.
SQL
ALTER NODE "newnodename" SET ADDRESS 'newnodename';

SQL Node Replacement Procedure

Step 1: Check SQL Node for the Administrator Role.

Before you replace a SQL Node, check whether the node performs the Administrator Role in addition to the SQL Node. If the node performs the Administrator Role, then follow this additional step to add this role to the new node.
Most SQL Nodes are configured to also perform the Administrator Role. When you remove nodes that perform the Administrator Role, you must replace that node with another node that performs the Administrator Role so that you maintain the level of redundancy in the RAFT protocol of the administrator.
Check the results by using this query.
SQL
SELECT
    n.name AS node_name,
    sr.service_role_type
FROM sys.nodes n
LEFT JOIN sys.service_roles sr ON sr.node_id = n.id
WHERE sr.service_role_type = 'admin';
Text
node_name service_role_type
--------------------------------------------------------------------------------------
----
sql0 admin
sql1 admin
sql2 admin

Step 2: Remove the Failed SQL Node from the Ocient System.

Remove the node role using the ALTER NODE SQL statement by replacing sql0 with the node name of your node.
SQL
ALTER NODE "sql0" REMOVE ROLE sql;
Delete the node using the DROP NODE SQL statement by replacing sql0 with the node name of your node.
SQL
DROP NODE "sql0";
If the failed node still exists in the Ocient System, the system does not add the replacement node.

Step 3: Register the New Node with the Ocient System.

Follow the Bootstrapping Node Procedure to register the new node. This procedure uses the rolehostd process to start any new nodes. When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node is used in subsequent steps.

Step 4: Add the SQL Role to the New Node.

To add a role, you must be a system-level user. Execute the following SQL statement from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name.
SQL
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" ADD ROLE sql;

Step 5: Add the Administrator Role. (Only for SQL Nodes with the Administrator Role)

If the node had the Administrator Role in Step 1, then add the Administrator Role back to the node. Execute the following SQL statement from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name.
SQL
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" ADD ROLE admin;

Step 6: Restart the Node.

Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node.
Shell
sudo systemctl restart rolehostd

Loader Node Replacement Procedure

Step 1: Remove the Failed Loader Node from the Ocient System.

Remove the node role using the ALTER NODE SQL statement by replacing loader0 with the name of your node.
SQL
ALTER NODE "loader0" REMOVE ROLE streamloader;
Delete the node using the DROP NODE SQL statement, replacing loader0 with the name of your node.
SQL
DROP NODE "loader0";
If the failed node still exists in the Ocient System, the system does not add the replacement node.

Step 2: Register the New Node with the Ocient System.

Follow the Bootstrapping Node Procedure to register the new node. When bootstrapping the replacement node, you can reuse the original node name or assign a new one. The node name of the newly registered node will be used in subsequent steps.

Step 3: Add the streamloader Role to the New Node.

To add a role, you must be a system-level user. Execute the following SQL statement, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name.
SQL
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" add role streamloader;
When adding the streamloader role, the node must be restarted if it is already running in order for the change to take effect.

Step 4: Configure the LAT Service.

Stop the LAT service by executing the following command at the shell prompt.
Shell
sudo systemctl stop lat
The LAT server configuration should be restored from the previous configuration. If a backup is unavailable, please contact Ocient Support for assistance or reference the other LAT configuration on the cluster to modify the default files.
  • Copy the lat.conf backup file to /etc/lat/lat.conf.
  • Copy the log4j2.xml backup file to /etc/lat/log4j2.xml.
Ensure that these two files have the appropriate ownership and permissions by executing the following commands.
Shell
chown lat:lat /etc/lat/lat.conf /etc/lat/log4j2.xml
chmod 644 /etc/lat/lat.conf /etc/lat/log4j2.xml

Step 5: Restart the Node.

Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node.
Shell
sudo systemctl restart rolehostd
Start the LAT service.
Shell
sudo systemctl start lat

Step 6: Update the LAT Pipeline if Required.

If a continuously streaming pipeline such as loading from is operating on the Ocient System, then the Ocient System requires you to update the restored LAT to run the same pipeline to restore full loading throughput. This process is different if a long-running File Load is in process. Kafka pipeline procedure
Pipeline TypeProcedure
Kafka Pipeline1. From the LAT Client, create the pipeline on the new LAT Node. This ensures the pipeline_id matches the existing LAT processes.
2. Start the pipeline on the new LAT Node. The new Loader Node joins the existing Loader Nodes, and the load automatically rebalances across all participating Loaders with the same pipeline_id
File load pipeline
Pipeline TypeProcedure
File Load PipelineFile loads from sources like S3, local file system, or NFS require rebalancing of partitions in the event that a node is lost or added back to an Ocient System.

In the event a node is lost, it is recommended that a rebalance is performed and the file load completes using the remaining nodes. The newly added node is not expected to participate in an already active file load.

Foundation Node Replacement Procedure

Step 1: Register the New Node with the Ocient System.

Follow the Bootstrapping Node Procedure to register the new node. When bootstrapping the new node, you can reuse the original node name or assign a new one. This tutorial uses the node name of the newly registered node in subsequent steps. These examples use foundation_new for the new node name.
If you want to reuse the original node name (the name of the node being removed), you must first rename the original node to prevent conflict. For example, if the original node was named foundation_original, this can be performed with the SQL statement ALTER NODE "foundation_original" RENAME TO "foundation_to_drop";

Step 2: Add the New Node to a Specific Storage Cluster.

To operate as a Foundation Node, the node must be added to a storage cluster. To add a node to a storage cluster, you must be a system-level user. Execute the following SQL statement from the SQL prompt by replacing storage_cluster_1 and foundation_new with your storage cluster name and node name, respectively.
SQL
ALTER CLUSTER "storage_cluster_1" ADD PARTICIPANTS "foundation_new";

Step 3: Restart the New Node.

Restart the rolehostd process on the new node by running this command at the shell terminal on the new node.
Shell
sudo systemctl restart rolehostd
The node should now be operational.

Step 4: Remove the Failed Foundation Node from the Ocient System.

Remove the original node from all clusters to which it belongs. For example, if the existing node is foundation_original and your storage cluster name is storage_cluster_1, execute the following SQL statement.
SQL
ALTER CLUSTER "storage_cluster_1" DROP PARTICIPANTS "foundation_original";
Delete the node using the DROP NODE SQL statement, replacing foundation_original with the name of your node.
SQL
DROP NODE "foundation_original";
If you renamed the failed original node in Step 1, be sure to use the correct name when dropping from the storage cluster and dropping the node.

Step 5: Rebuild Segments.

In most cases, after removing the original node, administrators follow the Guide to Rebuilding Segments to rebuild segments that were on the replaced Foundation Node.

Metadata Node Replacement Procedure

A Metadata Node is a node that only performs the Administrator Role.

Step 1: Remove the Failed Administrator Node from the Ocient System.

Delete the node using the DROP NODE SQL statement by replacing standaloneAdmin00 with the name of your node.
SQL
DROP NODE "standaloneAdmin00";
If the failed node still exists in the Ocient System, the system does not add the replacement node.

Step 2: Register the New Node with the Ocient System.

Follow the Bootstrapping Node Procedure to register the new node. When bootstrapping the replacement node, you can reuse the original node name or assign a new one. This tutorial uses the node name of the newly registered node in subsequent steps.

Step 3: Add the Administrator Role to the New Node.

To add a role, you must be a system-level user. Execute the following SQL statement from the SQL prompt, replacing <NODE_NAME_OF_THE_NEW_NODE> with the new node name.
SQL
ALTER NODE "<NODE_NAME_OF_THE_NEW_NODE>" add role admin;

Step 4: Restart the Node.

Restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node.
Shell
sudo systemctl restart rolehostd

Bootstrapping Node Procedure

To add a node to the system, follow these steps after meeting the prerequisites described in Ocient System Bootstrapping. These steps summarize the bootstrapping process.
All prerequisites should be validated prior to bootstrapping. Ensure that you have set the correct adminUserName and adminPassword in the bootstrap.conf file. Any advanced configuration parameters should also be specified in the bootstrap.conf file.
Step 1: Create a /var/opt/ocient/bootstrap.conf file on the node by replacing the <FIRST_NODE_ADDRESS> with the IP or Hostname of a node running the Administrator Role and the correct user name and password for your system.
YAML
adminHost: <FIRST_NODE_ADDRESS>
adminUserName: my_admin
adminPassword: example_password
Step 2: Start the new node.
Shell
sudo systemctl start rolehostd
This action accepts the node into the system and prepares it for role assignment. ALTER NODE DROP NODE ALTER CLUSTER System Catalog
Last modified on May 27, 2026