Manage Distributed Tasks
Distributed tasks are processes that run in the background of the System to keep the system running smoothly and efficiently. These tasks provide persistent storage of task information in the global metadata storage cluster of the Ocient System, allowing tasks to be tracked and retried until completion. For details about storage and related concepts, see Core Elements of an Ocient System.
The distributed task infrastructure is completely generic. The system instantiates tasks from factories using generic arguments and establishes communication using interfaces to abstract the business logic in whatever specific protocol is doing the actual work. Tasks must remain unchanged when duplicated and can be restarted on a new node by the infrastructure at any time. Tasks can also be structured recursively, with a parent task comprising several subtasks, which can also have subtasks, and so on.
Distributed tasks provide persistence across outages. To do this, the task infrastructure detects when the node running a task, or the task owner, has become unavailable and moves the task elsewhere. The administrator owner manages node unavailability. This owner is a node in the administrator cluster responsible for monitoring the status of a task and reassigning the task owner when needed. This pairing of administrator and task owners provides load balancing across the administration cluster by removing the need for every task owner to communicate with the administrator owner.
The Ocient System has different task types, such as:
- Data pipeline loading
- Rebuild segments
- Check for disk corruption by verifying segments
- Rebalance data across storage clusters and nodes
For data pipeline loading, see Load Data.
For details about rebuilding segments, checking the disk, and rebalancing data, see Distributed Tasks.
The characteristics of a task are a factory, serialized arguments, a location type, and a location identifier:
- Factory — A task class implementation that the Ocient System instantiates generically and executes the actual work of the task.
- Arguments — The inputs to a task the Ocient System stores generically as a string. These inputs can contain anything the task needs, e.g., flags modifying execution modes, identifiers on which to run, etc.
- Location type — The nature of the set of nodes where the task can run. This location can be any node in the system, any node in a specific cluster, or a specific node.
- Location identifier — The specific location entity when the task location type is a cluster or node.
Tasks have optional pre- and post-condition checks and must implement an asynchronous execute method. You can also override cancel and status poll methods.
Execute tasks using a pair of nodes: administrator and task owners. The administrator owner must be an active node in the metadata storage cluster. This node is responsible for starting and monitoring the task. The task owner can be any node that meets the location constraints specified by the task creator and executes the actual task. With the exception of subtasks, the Ocient System assigns both owners upon task creation, chosen randomly to balance the load.
To see the disposition of tasks in the system, see the sys.subtasks system catalog table. For details, see sys.subtasks.
The administration owner manages and monitors all steps in the life cycle of their own tasks. After the Ocient System adds a task to the cluster state and assigns an administrator owner, the Administrator Node sends a message to the task owner to submit or poll the current task status. After the task has started successfully, the administrator owner periodically polls the task owner for the task status, updating it in the cluster state as needed.
When a task completes, the administrator owner sets its state to a terminal status and no longer polls or interacts with the task. The Ocient System retains historical tasks in the cluster state until a configurable limit is reached to prevent the state from growing unboundedly.
The task owner maintains an in-memory map of running tasks by the identifier in the health protocol. The Ocient System pools tasks for execution status as needed. When the Ocient System receives a poll task request from an administrator owner, a task owner checks for the existence of the task, starting it if necessary and returning the current status.
The Ocient System starts a task by instantiating a task object using the factory type and then executing its pre-condition check, method, and post-condition check. If at any point the system encounters a failure, the system stops execution.
When a task completes, the task owner stores the results in memory until it receives an indication from the administrator owner that it is safe to remove the results.
An important responsibility of the administrator owner is to detect and handle task owner outages. If a poll fails for any reason, the administrator owner reassigns the owner in the cluster state and starts the task in the new location. Tasks must remain unchanged when duplicated and resume where they stopped when restarted.
If an administrator owner fails, the administrator leader detects the connectivity change and reassigns the administrator owner. The new administrator owner begins polling the task owner when it receives the update, so the ownership change should not interfere with task execution.
Generally, the task owner does not need to handle outages. However, in extreme situations, e.g., network splits, an administrator owner might not be able to communicate with its task owner and orphan a task. To prevent orphaned tasks from running in multiple locations in the system, the task owner times out and cancels a running task if it has not heard from the administrator owner in multiple poll cycles.
When you execute a task using a SQL statement, the task performs the operation by executing one or multiple subtasks. Thus, when you query the system catalog tables for subtasks, you must look for numerous subtasks and their individual progress rather than just one task.
You can also execute tasks as a series of subtasks, that run sequentially, in parallel, or some combination of the two. The series comprises a parent task with a number of child tasks. Each child task has a specified ordering. The Ocient System executes tasks with the same ordering in parallel. And, the system executes groups of child tasks in sequence according to ascending order. The Ocient System assigns an administrator owner when the child task should begin running. The system executes parent tasks directly on the administrator owner. These tasks are only responsible for watching the status of their running child tasks and starting the next child tasks when applicable. Parent tasks complete after all their children tasks complete.
To see the disposition of subtasks in the system, you can query the sys.subtasks system catalog table. For details, see sys.subtasks.
A client can request to gracefully cancel a task by setting an is_canceled flag on the task object. The administrator owner evaluates this state on its next poll cycle and notifies the task owner that the task should be canceled. The exact mechanism by which the running task is canceled varies by type, as some tasks might need to communicate with other protocols to stop execution.
Canceling a parent task causes all of its subtasks to be canceled. A subtask cannot be canceled individually.
If the task is successfully canceled, execution ends with a CANCELLED status and reverts back to the administrator owner and Raft state using the normal success or failure paths.
For example, cancel the task named my_task.
For the full syntax, see CANCEL TASK.