System Administration
Maintenance Overview
Replace Nodes
an {{ocient}} system requires a full node replacement in some scenarios for example, the os drive on a node might fail and necessitate a reinstallation and addition of the node other events, such as a complete node loss due to a hardware failure of the motherboard or damage to nvme storage drives, can also cause you to replace a node an administrator should be familiar with the different types of nodes described in the core elements of an ocient system docid\ adp4amtxi3djdrsq2khdb there are many similarities between the procedures for different node types, and the differences are clearly noted in the replacement procedures this procedure begins with replace nodes /#prerequisites then guides a system administrator through replacing any of the ocient node types this procedure also includes instructions about the replacement of an os drive (also called a boot drive) on an ocient node because the procedure for the replacement is identical to the replacement of the entire node both require reinstallation of the operating system and restoration of the configuration backup files to reestablish the identity of the ocient system you must work carefully with nodes that also perform the admin istrator role, because you must maintain admin istrator consensus execution of this process requires system administrator access to the ocient system database and sudo privilege on the node being replaced detection and alerting you must monitor all nodes such that the loss of a node on the system triggers a node availability alert in the monitoring system this table shows the potential consequences of the loss of each type of node sql node loss node type impact of node loss sql might impact query performance in an ocient system if the only remaining sql node is lost in a system, you cannot query that system if the sql node performs the admin istrator role, then the resiliency of the system could reduce to hardware failure you must maintain the number of nodes that perform the admin istrator role at all times to maintain the resiliency of the system foundation node loss 130,613 false true left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type loader node loss 130,613 false true left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type troubleshooting you must replace a node in the event of a hardware failure of certain components you can replace an os drive, data drives, and ram on most nodes a node replacement might be necessary in the case of a cpu, motherboard, or network interface card (nic) replacement the troubleshooting steps to recover an existing node are specific to the hardware model of the node please contact ocient support for troubleshooting issues with a node recovery prerequisites the sql nodes on the ocient system are active and running ocient (except for any sql nodes that are being replaced as part of this procedure) the os has been installed and configured on the replacement node or os drive for the node to be replaced the replacement node has been installed with the same version of the ocient software as the rest of the ocient system according to the ocient application installation docid\ l4to0wifbytuosh5nscob the bootstrap conf file for the node has been created in the /var/opt/ocient directory on the replacement node for details, see ocient system bootstrapping docid 4005nflvguw4fqfqa1spu for loader nodes only the replacement node has been installed with the same version of the loading and transformation software as the rest of the loader nodes as described in lat packaging and installation docid 4gawr9v 2cqsdff9an6t the following files are available from the node before the failure if these files are not available, please contact ocient support for assistance /etc/lat/lat conf /etc/lat/log4j2 xml /opt/lat/ lat data/ replacement procedure overview the replacement procedure is similar for all node types, but there are differences due to the way that different node types coordinate with one another the general procedure is remove the failed node from the ocient system register the new node with the system and accept node assign roles and add to clusters where appropriate start the new node if you want to reuse the name of a node, then you must rename the failed node and set the ip address to perform these actions, use these statements rename the node from nodename to newnodename alter node "nodename" rename to "newnodename"; set the ip address for the new node alter node "newnodename" set address 'newnodename'; sql node replacement procedure step 1 check sql node for the admin istrator role before you replace a sql node, check whether the node performs the admin istrator role in addition to the sql node if the node performs the admin istrator role, then follow this additional step to add this role to the new node most sql nodes are configured to also perform the admin istrator role when you remove nodes that perform the admin istrator role, you must replace that node with another node that performs the admin istrator role so that you maintain the level of redundancy in the raft protocol of the administrator check the results by using this query select n name as node name, sr service role type from sys nodes n left join sys service roles sr on sr node id = n id where sr service role type = 'admin'; node name service role type \ \ sql0 admin sql1 admin sql2 admin step 2 remove the failed sql node from the ocient system remove the node role using the alter node sql statement by replacing sql0 with the node name of your node alter node "sql0" remove role sql; delete the node using the drop node sql statement by replacing sql0 with the node name of your node drop node "sql0"; if the failed node still exists in the ocient system, the system does not add the replacement node step 3 register the new node with the ocient system follow the replace nodes /#bootstrapping node procedure to register the new node this procedure uses the rolehostd process to start any new nodes when bootstrapping the replacement node, you can reuse the original node name or assign a new one the node name of the newly registered node is used in subsequent steps step 4 add the sql role to the new node to add a role, you must be a system level user execute the following sql statement from the sql prompt, replacing \<node name of the new node> with the new node name alter node "\<node name of the new node>" add role sql; step 5 add the admin istrator role (only for sql nodes with the admin istrator role) if the node had the admin istrator role in step 1, then add the admin istrator role back to the node execute the following sql statement from the sql prompt, replacing \<node name of the new node> with the new node name alter node "\<node name of the new node>" add role admin; step 6 restart the node restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node sudo systemctl restart rolehostd loader node replacement procedure step 1 remove the failed loader node from the ocient system remove the node role using the alter node sql statement by replacing loader0 with the name of your node alter node "loader0" remove role streamloader; delete the node using the drop node sql statement, replacing loader0 with the name of your node drop node "loader0"; if the failed node still exists in the ocient system, the system does not add the replacement node step 2 register the new node with the ocient system follow the replace nodes /#bootstrapping node procedure to register the new node when bootstrapping the replacement node, you can reuse the original node name or assign a new one the node name of the newly registered node will be used in subsequent steps step 3 add the streamloader role to the new node to add a role, you must be a system level user execute the following sql statement, replacing \<node name of the new node> with the new node name alter node "\<node name of the new node>" add role streamloader; when adding the streamloader role, the node must be restarted if it is already running in order for the change to take effect step 4 configure the lat service stop the lat service by executing the following command at the shell prompt sudo systemctl stop lat the lat server configuration should be restored from the previous configuration if a backup is unavailable, please contact ocient support for assistance or reference the other lat configuration on the cluster to modify the default files copy the lat conf backup file to /etc/lat/lat conf copy the log4j2 xml backup file to /etc/lat/log4j2 xml ensure that these two files have the appropriate ownership and permissions by executing the following commands chown lat\ lat /etc/lat/lat conf /etc/lat/log4j2 xml chmod 644 /etc/lat/lat conf /etc/lat/log4j2 xml step 5 restart the node restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node sudo systemctl restart rolehostd start the lat service sudo systemctl start lat step 6 update the lat pipeline if required if a continuously streaming pipeline such as loading from {{kafka}} is operating on the ocient system, then the ocient system requires you to update the restored lat to run the same pipeline to restore full loading throughput this process is different if a long running file load is in process kafka pipeline procedure 106,637 trueleft unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type file load pipeline 108,635 trueleft unhandled content type left unhandled content type left unhandled content type left unhandled content type foundation node replacement procedure step 1 register the new node with the ocient system follow the replace nodes /#bootstrapping node procedure to register the new node when bootstrapping the new node, you can reuse the original node name or assign a new one this tutorial uses the node name of the newly registered node in subsequent steps these examples use foundation new for the new node name if you want to reuse the original node name (the name of the node being removed), you must first rename the original node to prevent conflict for example, if the original node was named foundation original , this can be performed with the sql statement alter node "foundation original" rename to "foundation to drop"; step 2 add the new node to a specific storage cluster to operate as a foundation node, the node must be added to a storage cluster to add a node to a storage cluster, you must be a system level user execute the following sql statement from the sql prompt by replacing storage cluster 1 and foundation new with your storage cluster name and node name, respectively alter cluster "storage cluster 1" add participants "foundation new"; step 3 restart the new node restart the rolehostd process on the new node by running this command at the shell terminal on the new node sudo systemctl restart rolehostd the node should now be operational step 4 remove the failed foundation node from the ocient system remove the original node from all clusters to which it belongs for example, if the existing node is foundation original and your storage cluster name is storage cluster 1 , execute the following sql statement alter cluster "storage cluster 1" drop participants "foundation original"; delete the node using the drop node sql statement, replacing foundation original with the name of your node drop node "foundation original"; if you renamed the failed original node in step 1, be sure to use the correct name when dropping from the storage cluster and dropping the node step 5 rebuild segments in most cases, after removing the original node, administrators follow the guide to rebuilding segments docid 12toghwotdgw2 1td9g3m to rebuild segments that were on the replaced foundation node metadata node replacement procedure a metadata node is a node that only performs the admin istrator role step 1 remove the failed admin istrator node from the ocient system delete the node using the drop node sql statement by replacing standaloneadmin00 with the name of your node drop node "standaloneadmin00"; if the failed node still exists in the ocient system, the system does not add the replacement node step 2 register the new node with the ocient system follow the replace nodes /#bootstrapping node procedure to register the new node when bootstrapping the replacement node, you can reuse the original node name or assign a new one this tutorial uses the node name of the newly registered node in subsequent steps step 3 add the admin istrator role to the new node to add a role, you must be a system level user execute the following sql statement from the sql prompt, replacing \<node name of the new node> with the new node name alter node "\<node name of the new node>" add role admin; step 4 restart the node restart the rolehostd process on the replacement node by running this command at the shell terminal on the replacement node sudo systemctl restart rolehostd bootstrapping node procedure to add a node to the system, follow these steps after meeting the prerequisites described in ocient system bootstrapping docid 4005nflvguw4fqfqa1spu these steps summarize the bootstrapping process all prerequisites should be validated prior to bootstrapping ensure that you have set the correct adminusername and adminpassword in the bootstrap conf file any advanced configuration parameters should also be specified in the bootstrap conf file step 1 create a /var/opt/ocient/bootstrap conf file on the node by replacing the \<first node address> with the ip or hostname of a node running the admin istrator role and the correct user name and password for your system adminhost \<first node address> adminusername my admin adminpassword example password step 2 start the new node sudo systemctl start rolehostd this action accepts the node into the system and prepares it for role assignment related links cluster and node management docid\ xga0pas8wadtq33 a x7v cluster and node management docid\ xga0pas8wadtq33 a x7v cluster and node management docid\ xga0pas8wadtq33 a x7v system catalog