Load Data
Data Pipeline Behavior Considerations
data pipelines enable you to load data from your chosen source into the {{ocienthyperscaledatawarehouse}} for details about the syntax, see docid\ pbyszqvu5wonpgoso qto when you use data pipelines to load data, consider how to resume a pipeline with file loading, restart the pipeline with {{kafka}} loading, and review pipeline dependencies resume a pipeline with file loading in many cases, a file based pipeline stops executing before completion you cannot resume a pipeline in a completed status to resume a pipeline, use the start pipeline sql statement before you resume a pipeline, the status of the pipeline must be created , failed , or stopped when a pipeline resumes, individual files remain in their most recent status as defined in the sys pipeline files system catalog table for batch pipelines, the {{ocient}} system does not add new files to the eligible file list when the pipeline resumes if you modify the contents of files during the loading process, the ocient system might experience issues with deduplication that cause duplicated rows or missing data avoid modifying files after you start a pipeline for the first time creating new files on your data source does not impact deduplication logic the start operation groups files using their extractor task id and stream source id identifiers the stream source id uniquely identifies partitions (i e , an ordered list of files), and extractor task id identifies the batch that loads a group of partitions file statuses the ocient system considers files with the statuses loaded , loaded with errors , or skipped to be in the terminal status, whereas other file statuses are still in process completed batches — if all the files in a particular batch have terminal status, then the pipeline does not attempt to reload the batch these files have been completely processed, so the ocient system ignores modifications to these files in process batches — if at least one file in a particular batch does not have terminal status, then the pipeline reloads the entire batch the pipeline reprocesses the in process batches and relies on row deduplication to prevent duplication of rows in the target tables modifications to files in an in process batch can but are not guaranteed to be picked up by a restart modifications to any files in this batch with the loaded , loaded with errors , or skipped statuses might cause issues with deduplication, leading to duplicate or missed data pending files — the ocient system does not assign all pending files to a partition the pipeline attempts to load these files after reloading any in progress batches load duplicate data from files sometimes you might want to load the same data multiple times if you want to load a second copy of the source data, you can either drop and recreate the pipeline to reset the sys pipeline files system catalog table create a second pipeline with a new name and the same configuration when you truncate the target tables and restart the pipeline, the ocient system does not reload the data restart with kafka loading ocient relies on the offset management and consumer group behavior in kafka to deliver exactly once loading semantics and to control the ocient pipeline behavior kafka offsets and consumer group identifiers if you set the write offsets option to true (default value is true ), the kafka consumers commit offsets back to kafka after data is considered durable in the database the kafka broker stores these offsets as the last committed offset for the group identifier group id for each pipeline, the group identifier defaults to \<systemid> \<databasename> \<pipelinename> , where the \<systemid> is the identifier of your system, \<database name> is the name of your database, and \<pipeline name> is the name of the data pipeline in most use cases, you should not manually change the group id field for a pipeline any kafka pipeline that has the same group id starts consuming from its last committed offset, or if you do not set the value, the pipeline uses the kafka auto offset reset policy to determine where to start for details, see https //docs confluent io/platform/current/clients/consumer html#offset management if you want to start loading from the beginning of a topic, configure an unused group id field (or use a group id field that did not commit any of its offsets back) and ensure the auto offset reset kafka configuration is appropriately set in the config option kafka pipeline deduplication the committed offset of a kafka partition lags slightly behind the rows that have been loaded into ocient these lags do not cause an issue with data duplication if you stop a pipeline before it can commit its most recent durable offset to kafka, restarting the same pipeline starts loading from the last committed offset however, the database deduplicates records sent twice for the same pipeline ocient deduplicates kafka data for the specified combination of pipeline identifier, kafka topic, and the kafka partition number while the consumer group offsets manage where the pipeline resumes loading, the ocient system enforces the exactly once loading of a kafka partition only if you stop or restart a pipeline with the same pipeline identifier if you drop a pipeline and create a new one with the same name, the ocient system creates a new pipeline identifier this action does not deduplicate data against data loaded in the original pipeline to preserve deduplication, instead of dropping the pipeline, use the create or replace pipeline sql statement with the original pipeline name and the pipeline correctly deduplicates against the original data do not run multiple pipelines concurrently with the same consumer group identifier this action leads to unpredictable data duplication if you want to increase the number of consumers that read from a kafka topic, increase the value of the cores parameter load duplicate data on kafka sometimes you might want to load the same data multiple times if you want to load a second copy of the source data from kafka, you can either drop the pipeline, recreate it with the same name, and reset the consumer group offsets manually create a new pipeline with a different name and load from the beginning of the topic pipeline database dependency each pipeline belongs to a database you cannot drop a database that has a running pipeline to drop a database, ensure that all pipelines in the database are in a non running status pipeline table dependency each pipeline has a target table you cannot drop a table that has a running pipeline to drop a table, ensure that all pipelines that are loading data into the table are in a non running status related links docid\ zncvnrhsf6fg1yvqk6mxt docid\ stial7oztpmpndfwcsm29 docid ymealxr8i3ef2yhn9r8a docid\ ti3mdibvgmuudmlqu9xpl docid\ vqvrmdyk8josxmkfsyprc docid\ f55ngxtki0f7kkmyatvug docid\ g6voewkufcxz2yscdfxx