Load Data
loading data into the {{ocienthyperscaledatawarehouse}} using the data pipeline functionality is as simple as writing a sql statement to query data from different supported sources the loading operation provides real time transformations and maximizes performance using all available processing resources with data pipelines, you manage data loading activities using data definition language (ddl) statements similar to the ones used for managing database tables you can access all information about pipelines by querying the system catalog tables you can execute operations on pipelines with a command line statement using standard interfaces like jdbc and odbc data pipelines overview a data pipeline is the primary way that the {{ocient}} system loads data each data pipeline is a database object that defines the end to end processing of data for the extraction, transformation, and load into ocient tables data pipelines execute across your ocient system to coordinate parallel loading tasks across many loader nodes and pipelines the pipeline object creating a pipeline is complex as there are many options to consider use the preview pipeline sql statement to preview the pipeline creation and load of data to ensure that the pipeline definition returns the results you expect after you are satisfied with the results, you can execute the create pipeline statement to create the pipeline at scale when you create a pipeline, you assign it a name and the pipeline exists as an object in the ocient system connected to this name then, you can control the object using your chosen name with sql statements like start pipeline , stop pipeline , create or replace pipeline , and drop pipeline you can also define your own function by using the create pipeline function sql statement execute these statements using a sql connection a pipeline maintains its own position during a load and enforces deduplication logic to ensure that the ocient system only loads data once the lifecycle of a pipeline defines both the deduplication logic and load position pipeline events during the life of a pipeline, you can start, stop, modify, and resume the pipeline without duplicating source data or losing the position in a load pipeline updates you can modify pipelines using the create or replace sql statement to update the transforms in a pipeline, but maintain the current position in the load load position if target tables are truncated or if a target table is dropped and then recreated, the pipeline maintains its own position in the load and will continue from its last position pipelines also gracefully handles many error cases and system failure modes the pipeline stores any errors that occur in the system catalog in association with the pipeline for more details on deduplication and error tolerance settings, see deduplication in data pipelines docid\ hoh2fxe xikbvznelzw6z error tolerance in data pipelines docid\ xniw zlvowxm 16x95wzr data pipelines docid\ xtmbp ie gg8t5kl5hx4i parts of a pipeline pipelines can operate in either a batch or continuous mode based on how you plan to load your data the ddl for data pipelines has three sections section syntax data source source data format extract format transformations into table select thus, for example, the full sql syntax has this format for the orders pipeline data pipeline this syntax has an s3 data source and loads data into the orders table in the public schema the select sql statement in the create pipeline statement selects the id , user id , product id , and other columns from a table create pipeline orders pipeline source s3 endpoint ' ' bucket ' ' filter ' ' extract format delimited into "public" "orders" select $1 as id, $2 as user id, $3 as product id, data source the source part in a pipeline defines the source of the data to load this ddl part identifies the type of the source (e g , {{aws}} s3, {{kafka}} ) as well as the details for the data that is relevant to the source (e g , s3 bucket and filters, kafka topic and consumer configuration) data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i data format the extract part in a pipeline defines the format of the data that the ocient system extracts from the data source this part includes the data format (e g , delimited, binary, json) and the details about the records (e g , record delimiter, how to treat empty data) in addition, the ocient system extracts metadata that you can load into tables data formats for data pipelines docid\ vsircgculjvf1h6bcq f data formats for data pipelines docid\ vsircgculjvf1h6bcq f data formats for data pipelines docid\ vsircgculjvf1h6bcq f data formats for data pipelines docid\ vsircgculjvf1h6bcq f data formats for data pipelines docid\ vsircgculjvf1h6bcq f data pipelines docid\ xtmbp ie gg8t5kl5hx4i transformations the into table select sql statement defines the target tables where the data loads and uses a select sql statement to extract fields from the source data (e g , csv field index $5 , json selector $order user name ) and map them to target columns (e g , as my column ) the select statement in a pipeline can utilize a subset of sql functions to convert the extracted data into the required format during loading (e g , to timestamp($my date, 'yyyy mm dd') ) the selectors you use to extract data from the source records can differ based on the data format (e g , json or delimited ) transform data in data pipelines docid\ idj spvbjhjhoyxwlptk3 transform data in data pipelines docid\ idj spvbjhjhoyxwlptk3 data formats for data pipelines docid\ vsircgculjvf1h6bcq f for loading workflows, see these examples data pipeline load of csv data from s3 docid\ ogq1j ldi6zxdmc5w02ub data pipeline load of json data from kafka docid 7w4zkd39v661cjmbydn2c data pipeline load of parquet data from s3 docid\ uoun0eje0shyuxq pzwws pipeline lifecycle this state machine shows the state transitions for a data pipeline you can start and stop data pipelines using ddl statements the status of the pipeline transitions automatically as the pipeline completes all assigned work or reaches a failed state while executing, the pipeline lists files or connects to streaming sources and creates tasks to execute work these states reflect the overall progress of a pipeline and capture details of the underlying tasks that execute the work of the pipeline state description created the pipeline is created, but no user has started it zero tasks in the created state zero files listed or no kafka consumer group created running the pipeline is running after a user starts the pipeline at least one task is in the queued , running , or cancelling states at least one file listed or kafka consumer group created stopped the pipeline is not processing data after a user stops the pipeline all tasks are in the queued , complete , failed , or cancelled states completed the pipeline finished all assigned work according to the error limits defined for the pipeline all tasks in are in the complete state failed the pipeline failed to complete and no longer runs due to error limits defined for the pipeline at least one task in the failed state all tasks are in the queued , complete , failed , or cancelled states in this state diagram, the transition arrows display the user actions that can trigger a state change state transitions without a label are system initiated transitions observe pipelines during pipeline operation, all of the information you need to observe progress, pipeline status, key events, success or failure, and errors is available in system catalog tables in the ocient system in addition, performance counters are available for many key loading metrics that you can add to observability systems monitor data pipelines docid\ ogysmxakcnncrhaxeurbc system catalog troubleshooting pipelines data pipelines support robust error handling capabilities crucial for developing new pipelines, identifying issues with ongoing operations, and correcting bad data pipelines can include a bad data target where the ocient system saves bad data for troubleshooting in addition, the system captures errors that you encounter during loading in system catalog tables such as sys pipeline events and sys pipeline errors exploring the data in these tables enables you to detect issues and identify the root cause of errors during loading manage errors in data pipelines docid\ rgsf79ofxqyeedrg9ckew frequently asked questions for data pipelines docid\ gaxtp8ghcygvo2ncchokd data pipeline loading errors docid\ h7bwnbkqw2r0cmmymeila data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i loading architecture pipelines require loader nodes to operate loader nodes are {{ocient}} nodes that have the streamloader role assignment the loading system can use all active nodes with this role to process pipelines when you execute a pipeline, ocient spreads the work among the available loader nodes the system executes these parallel processes as tasks on each of the loader nodes then, the system partitions the files or kafka partitions to load across the tasks, and loading proceeds in parallel when all tasks are completed, the pipeline is completed many pipelines can run in parallel on the loader nodes, and these nodes share resources the scale out architecture of the ocient system allows you to add more loader nodes as needed to expand the volume and complexity of pipelines that you can operate at the throughput required by your application ocient loading architecture that consists of sql client connections that interface with sql nodes, data sources that interface with loader nodes, and sql and loader nodes interface with foundation nodes for storage and processing internal operations of a pipeline as pipelines process records from a data source, the ocient system builds these rows into a row based storage structure called a page the system optimizes pages for rapid ingestion of data the system stores pages on foundation nodes to ensure data integrity as pages accumulate, loader nodes convert pages into a columnar storage format named a segment segments are highly compressed data structures that include data, indexes, and other statistical metadata to optimize query performance after the ocient system creates segments, the system forms them into groups and transfers them to foundation nodes for storage you do not typically need to be concerned with or control these internal operations however, it is important to understand that pipeline throughput can be affected by the parallel processing of many pipelines on shared resources, the number of indexes, and the type of compression used on your tables related links data control language (dcl) statement reference docid\ cq7pzjpzxsakbbsqsumai data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i data pipelines docid\ xtmbp ie gg8t5kl5hx4i frequently asked questions for data pipelines docid\ gaxtp8ghcygvo2ncchokd