Documentation Index
Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
Use this file to discover all available pages before exploring further.
Exactly Once Loading in Pipelines
The job of a pipeline is to load the data assigned to it exactly once. A pipeline maintains its own position during a load and enforces deduplication logic. The lifecycle of a pipeline object in the database defines both the deduplication logic and load position.Because the pipeline, not the target tables, defines the load position and deduplication scope, if the system truncates target tables or if the system drops and recreates a target table, the pipeline continues from its last position. It does not reset to the beginning position of the load if you attempt to load data into a new table.
- Pipeline Events — During the life of a pipeline, you can stop and resume the pipeline without duplicating source data or changing the position in a load.
- Pipeline Updates — You can modify pipelines using the
CREATE OR REPLACESQL statement to update the transforms in a pipeline but maintain the current position in the load. - Exactly Once Loading — In the event of transient system failures, the pipeline ensures exactly-once loading into the target table through replay and deduplication guarantees.
- Load Position
- Deduplication Scope
Load Position
You can control the pipeline by starting, stopping, modifying, and resuming the pipeline. While running, each pipeline maintains its position in the overall load. When it has reached the end of all data, the pipeline completes and does not attempt to reload the data. To load the data again, you must create a new pipeline or drop and recreate the pipeline.- For an -based load, the position consists of using consumer group offsets appropriately to create checkpoints.
- For File-based loads, file details are stored in the
sys.pipeline_filessystem catalog table and the system updates the status of each file as the load progresses.
Deduplication Scope
The deduplication scope defines the conditions under which a pipeline does not cause the same row to appear twice in a target table. This deduplication is how a pipeline ensures that it is safe to replay data during loading. This situation is common in failover situations or when you stop and restart a pipeline. The deduplication scope is the unique combination of:- A pipeline object
- A target table
Restarts and Deduplication
You can restart a pipeline without creating duplicate data in the target tables. However, there are some limitations and key assumptions for each data source. For example, in a file load, if you modify the contents of a file after the pipeline has started, then you can experience duplication of data or missed rows. For more details and key considerations, see Restarting a File Load and Restarting a Kafka Load.Deduplication and DROP TABLE
You can drop a table using theDROP TABLE SQL statement and then recreate it with the same name. In this case, an existing pipeline loads into the new table. When you drop a table and recreate it, these actions do not update the load position of the pipeline. When you restart the pipeline, it resumes from the load position of the pipeline, and all records from that position forward are loaded with a new deduplication scope.
Deduplication and TRUNCATE TABLE
You can truncate a table (using theTRUNCATE TABLE SQL statement) that is the target of a pipeline whether the pipeline is running or stopped. These actions do not update the load position of the pipeline or change the deduplication scope. The system deduplicates any data that was loaded before the truncate and replays after the truncate for any reason (e.g., restart or transient failure). Also, the system loads all new data.
When you truncate the target tables and restart the pipeline, the Ocient System does not reload the data.
Load Duplicate Data
Sometimes, you might want to load the same data multiple times, but the pipeline load position and deduplication scope prevent this. If you want to load a second copy of the source data, you can follow one of these approaches: File-Based Pipelines- Drop and recreate the pipeline to reset the
sys.pipeline_filessystem catalog table. - Create a second pipeline with a new name and the same configuration.
- Drop the pipeline and recreate it with the same name, which defaults to the same consumer
group.idbut has a new deduplication scope. Then, reset the consumer group offsets using Kafka tools to the chosen starting point. - Create a new pipeline with a different name and load from the beginning of the topic using
auto.offset.reset.

