Resume a Pipeline with File Loading
In many cases, a file-based pipeline stops executing before completion. You cannot resume a pipeline in aCOMPLETED status.
To resume a pipeline, use the START PIPELINE SQL statement. Before you resume a pipeline, the status of the pipeline must be CREATED, FAILED, or STOPPED. When a pipeline resumes, individual files remain in their most recent status as defined in the sys.pipeline_files system catalog table. For BATCH pipelines, the System does not add new files to the eligible file list when the pipeline resumes.
The START operation groups files using their extractor_task_id and stream_source_id identifiers. The stream_source_id uniquely identifies partitions (i.e., an ordered list of files), and extractor_task_id identifies the batch that loads a group of partitions.
File Statuses
The Ocient System considers files with the statusesLOADED, LOADED_WITH_ERRORS, or SKIPPED to be in the terminal status, whereas other file statuses are still in process.
- Completed Batches — If all the files in a particular batch have terminal status, then the pipeline does not attempt to reload the batch. These files have been completely processed, so the Ocient System ignores modifications to these files.
- In-Process Batches — If at least one file in a particular batch does not have terminal status, then the pipeline reloads the entire batch. The pipeline reprocesses the in-process batches and relies on row deduplication to prevent duplication of rows in the target tables.
- Modifications to files in an in-process batch can but are not guaranteed to be picked up by a restart.
- Modifications to any files in this batch with the
LOADED,LOADED_WITH_ERRORS, orSKIPPEDstatuses might cause issues with deduplication, leading to duplicate or missed data.
- Pending Files — The Ocient System does not assign all
PENDINGfiles to a partition. The pipeline attempts to load these files after reloading any in-progress batches.
Load Duplicate Data from Files
Sometimes you might want to load the same data multiple times. If you want to load a second copy of the source data, you can either:- Drop and recreate the pipeline to reset the
sys.pipeline_filessystem catalog table. - Create a second pipeline with a new name and the same configuration.
When you truncate the target tables and restart the pipeline, the Ocient System does not reload the data.
Restart with Kafka Loading
Ocient relies on the offset management and consumer group behavior in Kafka to deliver exactly-once loading semantics and to control the Ocient pipeline behavior.Kafka Offsets and Consumer Group Identifiers
If you set theWRITE_OFFSETS option to true (default value is true), the Kafka consumers commit offsets back to Kafka after data is considered durable in the database. The Kafka Broker stores these offsets as the last committed offset for the group identifier group.id.
For each pipeline, the group identifier defaults to <systemId>__<databaseName>__<pipelineName>, where the <systemId> is the identifier of your system, <database-name> is the name of your database, and <pipeline-name> is the name of the data pipeline. In most use cases, you should not manually change the group.id field for a pipeline.
Any Kafka pipeline that has the same group.id starts consuming from its last committed offset, or if you do not set the value, the pipeline uses the Kafka auto.offset.reset policy to determine where to start. For details, see Kafka offset management.
If you want to start loading from the beginning of a topic, configure an unused
group.id field (or use a group.id field that did not commit any of its offsets back) and ensure the auto.offset.reset Kafka configuration is appropriately set in the CONFIG option.Kafka Pipeline Deduplication
The committed offset of a Kafka partition lags slightly behind the rows that have been loaded into Ocient. These lags do not cause an issue with data duplication. If you stop a pipeline before it can commit its most recent durable offset to Kafka, restarting the same pipeline starts loading from the last committed offset. However, the database deduplicates records sent twice for the same pipeline. Ocient deduplicates Kafka data for the specified combination of pipeline identifier, Kafka topic, and the Kafka partition number. While the consumer group offsets manage where the pipeline resumes loading, the Ocient System enforces the exactly-once loading of a Kafka partition only if you stop or restart a pipeline with the same pipeline identifier.Load Duplicate Data on Kafka
Sometimes you might want to load the same data multiple times. If you want to load a second copy of the source data from Kafka, you can either:- Drop the pipeline, recreate it with the same name, and reset the consumer group offsets manually.
- Create a new pipeline with a different name and load from the beginning of the topic.

