Load Data

Deduplication in Data Pipelines

exactly once loading in pipelines the job of a pipeline is to load the data assigned to it exactly once a pipeline maintains its own position during a load and enforces deduplication logic the lifecycle of a pipeline object in the database defines both the deduplication logic and load position because the pipeline, not the target tables, defines the load position and deduplication scope, if the system truncates target tables or if the system drops and recreates a target table, the pipeline continues from its last position it does not reset to the beginning position of the load if you attempt to load data into a new table key benefits are pipeline events — during the life of a pipeline, you can stop and resume the pipeline without duplicating source data or changing the position in a load pipeline updates — you can modify pipelines using the create or replace sql statement to update the transforms in a pipeline but maintain the current position in the load exactly once loading — in the event of transient system failures, the pipeline ensures exactly once loading into the target table through replay and deduplication guarantees two concepts are critical to understand how {{ocient}} controls which data is loaded in a pipeline and how ocient pipelines ensure exactly once loading load position deduplication scope load position you can control the pipeline by starting, stopping, modifying, and resuming the pipeline while running, each pipeline maintains its position in the overall load when it has reached the end of all data, the pipeline completes and does not attempt to reload the data to load the data again, you must create a new pipeline or drop and recreate the pipeline for an {{kafka}} based load, the position consists of using consumer group offsets appropriately to create checkpoints for file based loads, file details are stored in the sys pipeline files system catalog table and the system updates the status of each file as the load progresses in both cases, the load position defines where the pipeline resumes loading if stopped and resumed deduplication scope the deduplication scope defines the conditions under which a pipeline does not cause the same row to appear twice in a target table this deduplication is how a pipeline ensures that it is safe to replay data during loading this situation is common in failover situations or when you stop and restart a pipeline the deduplication scope is the unique combination of a pipeline object a target table a pipeline guarantees that if it sends the same row to the same target table twice, the system only loads it into the table once if you drop and recreate either the pipeline or the table, then the situation is a new deduplication scope if you begin sending data to a new target table, this situation is a new deduplication scope a row for the same record can appear in both the old and new tables restarts and deduplication you can restart a pipeline without creating duplicate data in the target tables however, there are some limitations and key assumptions for each data source for example, in a file load, if you modify the contents of a file after the pipeline has started, then you can experience duplication of data or missed rows for more details and key considerations, see docid\ l8tdfpfzzvzeyabc2h7bq and docid\ l8tdfpfzzvzeyabc2h7bq deduplication and drop table you can drop a table using the drop table sql statement and then recreate it with the same name in this case, an existing pipeline loads into the new table when you drop a table and recreate it, these actions do not update the load position of the pipeline when you restart the pipeline, it resumes from the load position of the pipeline, and all records from that position forward are loaded with a new deduplication scope deduplication and truncate table you can truncate a table (using the truncate table sql statement) that is the target of a pipeline whether the pipeline is running or stopped these actions do not update the load position of the pipeline or change the deduplication scope the system deduplicates any data that was loaded before the truncate and replays after the truncate for any reason (e g , restart or transient failure) also, the system loads all new data when you truncate the target tables and restart the pipeline, the ocient system does not reload the data load duplicate data sometimes, you might want to load the same data multiple times, but the pipeline load position and deduplication scope prevent this if you want to load a second copy of the source data, you can follow one of these approaches file based pipelines drop and recreate the pipeline to reset the sys pipeline files system catalog table create a second pipeline with a new name and the same configuration kafka based pipelines drop the pipeline and recreate it with the same name, which defaults to the same consumer group id but has a new deduplication scope then, reset the consumer group offsets using kafka tools to the chosen starting point create a new pipeline with a different name and load from the beginning of the topic using auto offset reset related links docid\ l8tdfpfzzvzeyabc2h7bq docid\ l8tdfpfzzvzeyabc2h7bq docid\ l8tdfpfzzvzeyabc2h7bq docid\ l8tdfpfzzvzeyabc2h7bq