Startup and Shutdown Procedure
operates as a distributed system, so properly shutting down the system and preventing sudden power loss to nodes is important to ensure data integrity is maintained.
Execution of this process requires System Administrator access to the database and sudo privilege on the node being started or shut down. This procedure is not applicable to fully managed Ocient Systems.
The startup procedure can be used for starting the rolehostd process on a running node as well as powering on nodes in an Ocient System. In most installations, the rolehostd process is configured to automatically start on system boot up.
The rolehostd process is the main executable for an Ocient System. During startup of the rolehostd process, every non-Admininistrator Node will reach out to one of the Administrator Nodes. These nodes require a "quorum" for the Ocient system to startup and operate. To establish a quorum, a majority of Administrator Nodes must be actively running the rolehostd process. For example, when an Ocient System has three Administrator Nodes, at least two must be active to establish quorum and service requests from other nodes.
Nodes in an Ocient system should started in the following sequence for smoothest operation:
Queries will not complete until Foundation Nodes are online.
- Start the rolehostd process on all dedicated Administrator Nodes and SQL Nodes using the systemctl start rolehostd command.
- Start the rolehostd process on all Foundation Nodes using the systemctl start rolehostd command.
- Start the rolehostd process on all Loader Nodes using the systemctl start rolehostd command.
- After the rolehostd process has finished starting on all Loader Nodes, start the Loading and Transformation service using the systemctl start lat command.
The shutdown procedure can be used for stopping the rolehostd process as well as powering off nodes in the Ocient system.
Properly sequencing the shutdown of nodes in an Ocient system ensures a smooth power down process and reduces the appearance of errors to users in loading and querying as the system goes offline. The rolehostd process must be running on nodes with the Administrator Role for Global Dictionary Compression (GDC) lookup to work during loading and querying of data. If these nodes are shut down first, this will cause GDC lookups to fail, resulting in failures for both loading and querying of data.
It is recommended to stagger powering on of nodes by few seconds to reduce the surge in power distribution circuits.
When you execute systemctl kill -s SIGKILL rolehostd, the Ocient system sends SIGKILL to the rolehostd process. The node stops immediately without waiting for any queries or loading that are active on the node to finish. Some potential errors might be visible to the user. To avoid the errors, use the quiescing process.
The Ocient System should be stopped in the following sequence:
- Stop the Loading and Transformation service on all Loader Nodes using the systemctl stop lat command.
- Stop the rolehostd process on all Loader Nodes the systemctl kill -s SIGKILL rolehostd command.
- Stop the rolehostd process on all Foundation Nodes the systemctl kill -s SIGKILL rolehostd command. Any queries issued after this point will fail.
- Stop the rolehostd process on all SQL Nodes using the systemctl kill -s SIGKILL rolehostd command.
- Stop the rolehostd process on all Administrator Nodes the systemctl kill -s SIGKILL rolehostd command.
Quiescing nodes can be used for gracefully stopping the rolehostd process without interrupting loading or running queries. This process is most useful for maintenance purposes or for node upgrades. Quiescing refers to the graceful shut down of a node, whereas quiesced refers to a node that has completed quiescing and is now fully shut down.
After the node shutdown, quiescing nodes stop interacting with the rest of the system. Only quiesce one node at a time. When you quiesce multiple nodes at the same time, the quiesce process can freeze or lose forward progress for queries and loading. At startup, the node automatically rejoins the system and participates normally.
To issue a quiesce, execute systemctl stop rolehostd. This command sends the SIGTERM signal to the rolehostd process. The node begins the quiesce process, waits for all relevant queries or loading to finish, and shuts down. If a long-running query causes the quiesce to freeze, kill the query to finish the quiesce process.
During the quiesce process, a SQL Node continues to accept new connections from clients, such as JDBC or pyocient. However, when these clients run queries, explains, or DDL updates, the Ocient System redirects the clients to another online SQL node, if one exists. A user might connect or run a query with the force option, which overrides the redirect behavior and forces the command to run on the SQL Node in the quiesce process. In this case, the Ocient System does not guarantee the successful completion of the query.
A Foundation Node in the quiesce process waits for any segment rebuilds that are in progress to finish. When you interrupt or fail the rebuild, the quiesce process finishes shortly.
When a Loader Node is in the middle of loading and you initiate the quiesce process on this node, stop the Loading and Transform (LAT) service that is on the same node.
Quiesce has a default time limit of 30 minutes, controlled by the systemd service. After a node quiesces for 30 minutes, the systemd service sends the SIGKILL signal to rolehostd and forcefully kills the node. To change this timeout, edit the TimeoutStopSec parameter in the rolehostd systemd service file.