Deduplication in Data Pipelines
The job of a pipeline is to load the data assigned to it exactly once. A pipeline maintains its own position during a load and enforces deduplication logic. The lifecycle of a pipeline object in the database defines both the deduplication logic and load position.
Because the pipeline, not the target tables, defines the load position and deduplication scope, if the system truncates target tables or if the system drops and recreates a target table, the pipeline continues from its last position. It does not reset to the beginning position of the load if you attempt to load data into a new table.
Key benefits are:
- Pipeline Events — During the life of a pipeline, you can stop and resume the pipeline without duplicating source data or changing the position in a load.
- Pipeline Updates — You can modify pipelines using the CREATE OR REPLACE SQL statement to update the transforms in a pipeline but maintain the current position in the load.
- Exactly Once Loading — In the event of transient system failures, the pipeline ensures exactly-once loading into the target table through replay and deduplication guarantees.
Two concepts are critical to understand how controls which data is loaded in a pipeline and how Ocient pipelines ensure exactly-once loading.
- Load Position
- Deduplication Scope
You can control the pipeline by starting, stopping, modifying, and resuming the pipeline. While running, each pipeline maintains its position in the overall load. When it has reached the end of all data, the pipeline completes and does not attempt to reload the data. To load the data again, you must create a new pipeline or drop and recreate the pipeline.
- For an -based load, the position consists of using consumer group offsets appropriately to create checkpoints.
- For File-based loads, file details are stored in the sys.pipeline_files system catalog table and the system updates the status of each file as the load progresses.
In both cases, the load position defines where the pipeline resumes loading if stopped and resumed.
The deduplication scope defines the conditions under which a pipeline does not cause the same row to appear twice in a target table. This deduplication is how a pipeline ensures that it is safe to replay data during loading. This situation is common in failover situations or when you stop and restart a pipeline.
The deduplication scope is the unique combination of:
- A pipeline object
- A target table
A pipeline guarantees that if it sends the same row to the same target table twice, the system only loads it into the table once. If you drop and recreate either the pipeline or the table, then the situation is a new deduplication scope.
If you begin sending data to a new target table, this situation is a new deduplication scope. A row for the same record can appear in both the old and new tables.
You can restart a pipeline without creating duplicate data in the target tables. However, there are some limitations and key assumptions for each data source.
For example, in a file load, if you modify the contents of a file after the pipeline has started, then you can experience duplication of data or missed rows.
For more details and key considerations, see Restarting a File Load and Restarting a Kafka Load.
You can drop a table using the DROP TABLE SQL statement and then recreate it with the same name. In this case, an existing pipeline loads into the new table. When you drop a table and recreate it, these actions do not update the load position of the pipeline. When you restart the pipeline, it resumes from the load position of the pipeline, and all records from that position forward are loaded with a new deduplication scope.
You can truncate a table (using the TRUNCATE TABLE SQL statement) that is the target of a pipeline whether the pipeline is running or stopped. These actions do not update the load position of the pipeline or change the deduplication scope. The system deduplicates any data that was loaded before the truncate and replays after the truncate for any reason (e.g., restart or transient failure). Also, the system loads all new data.
When you truncate the target tables and restart the pipeline, the Ocient System does not reload the data.
Sometimes, you might want to load the same data multiple times, but the pipeline load position and deduplication scope prevent this.
If you want to load a second copy of the source data, you can follow one of these approaches:
File-Based Pipelines
- Drop and recreate the pipeline to reset the sys.pipeline_files system catalog table.
- Create a second pipeline with a new name and the same configuration.
Kafka-Based Pipelines
- Drop the pipeline and recreate it with the same name, which defaults to the same consumer group.id but has a new deduplication scope. Then, reset the consumer group offsets using Kafka tools to the chosen starting point.
- Create a new pipeline with a different name and load from the beginning of the topic using auto.offset.reset.