Load Data
Loading data into the using the data pipeline functionality is as simple as writing a SQL statement to query data from different supported sources. The loading operation provides real-time transformations and maximizes performance using all available processing resources.
With data pipelines, you manage data loading activities using Data Definition Language (DDL) statements similar to the ones used for managing database tables. You can access all information about pipelines by querying the system catalog tables. You can execute operations on pipelines with a command-line statement using standard interfaces like JDBC and ODBC.
A data pipeline is the primary way that the System loads data. Each data pipeline is a database object that defines the end-to-end processing of data for the extraction, transformation, and load into Ocient tables. Data pipelines execute across your Ocient System to coordinate parallel loading tasks across many Loader Nodes and pipelines.
Creating a pipeline is complex as there are many options to consider. Use the PREVIEW PIPELINE SQL statement to preview the pipeline creation and load of data to ensure that the pipeline definition returns the results you expect. After you are satisfied with the results, you can execute the CREATE PIPELINE statement to create the pipeline at scale.
When you create a pipeline, you assign it a name and the pipeline exists as an object in the Ocient System connected to this name. Then, you can control the object using your chosen name with SQL statements like START PIPELINE, STOP PIPELINE, CREATE OR REPLACE PIPELINE, and DROP PIPELINE. You can also define your own function by using the CREATE PIPELINE FUNCTION SQL statement. Execute these statements using a SQL connection.
A pipeline maintains its own position during a load and enforces deduplication logic to ensure that the Ocient System only loads data once. The lifecycle of a pipeline defines both the deduplication logic and load position.
- Pipeline Events: During the life of a pipeline, you can start, stop, modify, and resume the pipeline without duplicating source data or losing the position in a load.
- Pipeline Updates: You can modify pipelines using the CREATE OR REPLACE SQL statement to update the transforms in a pipeline, but maintain the current position in the load.
- Load Position: If target tables are truncated or if a target table is dropped and then recreated, the pipeline maintains its own position in the load and will continue from its last position.
Pipelines also gracefully handles many error cases and system failure modes. The pipeline stores any errors that occur in the system catalog in association with the pipeline.
For more details on Deduplication and Error Tolerance settings, see:
Pipelines can operate in either a BATCH or CONTINUOUS mode based on how you plan to load your data.
The DDL for Data Pipelines has three sections.
Section | Syntax |
---|---|
Data Source | SOURCE ... |
Data Format | EXTRACT FORMAT ... |
Transformations | INTO table SELECT ... |
Thus, for example, the full SQL syntax has this format for the orders_pipeline data pipeline. This syntax has an S3 data source and loads data into the orders table in the public schema. The SELECT SQL statement in the CREATE PIPELINE statement selects the id, user_id, product_id, and other columns from a table.
The SOURCE part in a pipeline defines the source of the data to load. This DDL part identifies the type of the source (e.g., S3, ) as well as the details for the data that is relevant to the source (e.g., S3 bucket and filters, Kafka topic and consumer configuration).
The EXTRACT part in a pipeline defines the format of the data that the Ocient System extracts from the data source. This part includes the data format (e.g., delimited, binary, JSON) and the details about the records (e.g., record delimiter, how to treat empty data). In addition, the Ocient System extracts metadata that you can load into tables.
The INTO table SELECT ... SQL statement defines the target tables where the data loads and uses a SELECT SQL statement to extract fields from the source data (e.g., CSV field index $5, JSON selector $order.user_name) and map them to target columns (e.g., ... as my_column).
The SELECT statement in a pipeline can utilize a subset of SQL functions to convert the extracted data into the required format during loading (e.g., TO_TIMESTAMP($my_date, 'YYYY-MM-DD')).
The selectors you use to extract data from the source records can differ based on the data format (e.g., JSON or DELIMITED).
For loading workflows, see these examples:
This state machine shows the state transitions for a data pipeline. You can start and stop data pipelines using DDL statements. The status of the pipeline transitions automatically as the pipeline completes all assigned work or reaches a failed state.
While executing, the pipeline lists files or connects to streaming sources and creates tasks to execute work. These states reflect the overall progress of a pipeline and capture details of the underlying tasks that execute the work of the pipeline.
State | Description |
---|---|
Created | The pipeline is created, but no user has started it.
|
Running | The pipeline is running after a user starts the pipeline.
|
Stopped | The pipeline is not processing data after a user stops the pipeline.
|
Completed | The pipeline finished all assigned work according to the error limits defined for the pipeline.
|
Failed | The pipeline failed to complete and no longer runs due to error limits defined for the pipeline.
|
In this state diagram, the transition arrows display the user actions that can trigger a state change. State transitions without a label are system-initiated transitions.

During pipeline operation, all of the information you need to observe progress, pipeline status, key events, success or failure, and errors is available in system catalog tables in the Ocient System. In addition, performance counters are available for many key loading metrics that you can add to observability systems.
Data pipelines support robust error-handling capabilities crucial for developing new pipelines, identifying issues with ongoing operations, and correcting bad data. Pipelines can include a bad data target where the Ocient System saves bad data for troubleshooting. In addition, the system captures errors that you encounter during loading in system catalog tables such as sys.pipeline_events and sys.pipeline_errors. Exploring the data in these tables enables you to detect issues and identify the root cause of errors during loading.
Pipelines require Loader Nodes to operate. Loader Nodes are Nodes that have the streamloader role assignment. The loading system can use all active nodes with this role to process pipelines.
When you execute a pipeline, Ocient spreads the work among the available Loader Nodes. The system executes these parallel processes as tasks on each of the Loader Nodes. Then, the system partitions the files or Kafka partitions to load across the tasks, and loading proceeds in parallel. When all tasks are completed, the pipeline is completed. Many pipelines can run in parallel on the Loader Nodes, and these Nodes share resources.
The scale-out architecture of the Ocient System allows you to add more Loader Nodes as needed to expand the volume and complexity of pipelines that you can operate at the throughput required by your application.

As pipelines process records from a data source, the Ocient System builds these rows into a row-based storage structure called a page. The system optimizes pages for rapid ingestion of data. The system stores pages on Foundation Nodes to ensure data integrity.
As pages accumulate, Loader Nodes convert pages into a columnar storage format named a Segment. Segments are highly compressed data structures that include data, indexes, and other statistical metadata to optimize query performance. After the Ocient System creates Segments, the system forms them into groups and transfers them to Foundation Nodes for storage.
You do not typically need to be concerned with or control these internal operations. However, it is important to understand that pipeline throughput can be affected by the parallel processing of many pipelines on shared resources, the number of indexes, and the type of compression used on your tables.