Data Pipeline Behavior Considerations

Data pipelines enable you to load data from your chosen source into the . For details about the syntax, see Data Pipelines. When you use data pipelines to load data, consider how to resume a pipeline with file loading, restart the pipeline with loading, and review pipeline dependencies.

Resume a Pipeline with File Loading

In many cases, a file-based pipeline stops executing before completion. You cannot resume a pipeline in a COMPLETED status. To resume a pipeline, use the START PIPELINE SQL statement. Before you resume a pipeline, the status of the pipeline must be CREATED, FAILED, or STOPPED. When a pipeline resumes, individual files remain in their most recent status as defined in the sys.pipeline_files system catalog table. For BATCH pipelines, the System does not add new files to the eligible file list when the pipeline resumes.

If you modify the contents of files during the loading process, the Ocient System might experience issues with deduplication that cause duplicated rows or missing data. Avoid modifying files after you start a pipeline for the first time.Creating new files on your data source does not impact deduplication logic.

The START operation groups files using their extractor_task_id and stream_source_id identifiers. The stream_source_id uniquely identifies partitions (i.e., an ordered list of files), and extractor_task_id identifies the batch that loads a group of partitions.

File Statuses

The Ocient System considers files with the statuses LOADED, LOADED_WITH_ERRORS, or SKIPPED to be in the terminal status, whereas other file statuses are still in process.

Completed Batches — If all the files in a particular batch have terminal status, then the pipeline does not attempt to reload the batch. These files have been completely processed, so the Ocient System ignores modifications to these files.
In-Process Batches — If at least one file in a particular batch does not have terminal status, then the pipeline reloads the entire batch. The pipeline reprocesses the in-process batches and relies on row deduplication to prevent duplication of rows in the target tables.
- Modifications to files in an in-process batch can but are not guaranteed to be picked up by a restart.
- Modifications to any files in this batch with the LOADED, LOADED_WITH_ERRORS, or SKIPPED statuses might cause issues with deduplication, leading to duplicate or missed data.
Pending Files — The Ocient System does not assign all PENDING files to a partition. The pipeline attempts to load these files after reloading any in-progress batches.

Load Duplicate Data from Files

Sometimes you might want to load the same data multiple times. If you want to load a second copy of the source data, you can either:

Drop and recreate the pipeline to reset the sys.pipeline_files system catalog table.
Create a second pipeline with a new name and the same configuration.

When you truncate the target tables and restart the pipeline, the Ocient System does not reload the data.

Restart with Kafka Loading

Ocient relies on the offset management and consumer group behavior in Kafka to deliver exactly-once loading semantics and to control the Ocient pipeline behavior.

Kafka Offsets and Consumer Group Identifiers

If you set the WRITE_OFFSETS option to true (default value is true), the Kafka consumers commit offsets back to Kafka after data is considered durable in the database. The Kafka Broker stores these offsets as the last committed offset for the group identifier group.id. For each pipeline, the group identifier defaults to <systemId>__<databaseName>__<pipelineName>, where the <systemId> is the identifier of your system, <database-name> is the name of your database, and <pipeline-name> is the name of the data pipeline. In most use cases, you should not manually change the group.id field for a pipeline. Any Kafka pipeline that has the same group.id starts consuming from its last committed offset, or if you do not set the value, the pipeline uses the Kafka auto.offset.reset policy to determine where to start. For details, see Kafka offset management.

If you want to start loading from the beginning of a topic, configure an unused group.id field (or use a group.id field that did not commit any of its offsets back) and ensure the auto.offset.reset Kafka configuration is appropriately set in the CONFIG option.

Kafka Pipeline Deduplication

The committed offset of a Kafka partition lags slightly behind the rows that have been loaded into Ocient. These lags do not cause an issue with data duplication. If you stop a pipeline before it can commit its most recent durable offset to Kafka, restarting the same pipeline starts loading from the last committed offset. However, the database deduplicates records sent twice for the same pipeline. Ocient deduplicates Kafka data for the specified combination of pipeline identifier, Kafka topic, and the Kafka partition number. While the consumer group offsets manage where the pipeline resumes loading, the Ocient System enforces the exactly-once loading of a Kafka partition only if you stop or restart a pipeline with the same pipeline identifier.

If you drop a pipeline and create a new one with the same name, the Ocient System creates a new pipeline identifier. This action does not deduplicate data against data loaded in the original pipeline. To preserve deduplication, instead of dropping the pipeline, use the CREATE OR REPLACE PIPELINE SQL statement with the original pipeline name and the pipeline correctly deduplicates against the original data.

Do not run multiple pipelines concurrently with the same consumer group identifier. This action leads to unpredictable data duplication. If you want to increase the number of consumers that read from a Kafka topic, increase the value of the CORES parameter.

Load Duplicate Data on Kafka

Sometimes you might want to load the same data multiple times. If you want to load a second copy of the source data from Kafka, you can either:

Drop the pipeline, recreate it with the same name, and reset the consumer group offsets manually.
Create a new pipeline with a different name and load from the beginning of the topic.

Pipeline Database Dependency

Each pipeline belongs to a database. You cannot drop a database that has a running pipeline. To drop a database, ensure that all pipelines in the database are in a non-running status.

Pipeline Table Dependency

Each pipeline has a target table. You cannot drop a table that has a running pipeline. To drop a table, ensure that all pipelines that are loading data into the table are in a non-running status. Load Data Data Pipeline Load of CSV Data from S3 Data Pipeline Load of Parquet Data from S3 Transform Data in Data Pipelines Load Metadata and File-Based Partitioned Data in Data Pipelines Data Control Language (DCL) Statement Reference Identifiers

​Resume a Pipeline with File Loading

​File Statuses

​Load Duplicate Data from Files

​Restart with Kafka Loading

​Kafka Offsets and Consumer Group Identifiers

​Kafka Pipeline Deduplication

​Load Duplicate Data on Kafka

​Pipeline Database Dependency

​Pipeline Table Dependency

​Related Links

Resume a Pipeline with File Loading

File Statuses

Load Duplicate Data from Files

Restart with Kafka Loading

Kafka Offsets and Consumer Group Identifiers

Kafka Pipeline Deduplication

Load Duplicate Data on Kafka

Pipeline Database Dependency

Pipeline Table Dependency

Related Links