SQL Reference
Data Definition Language (DDL)...
Data Pipelines
data pipelines enable you to load data from your chosen source into the {{ocienthyperscaledatawarehouse}} you can preview a data pipeline using the preview pipeline sql statement to test the creation statement and any required data transformation then, you can create a data pipeline using the create pipeline statement to manage the load, use the start pipeline statement to start the load and the stop pipeline statement to stop it you can rename a data pipeline using the alter pipeline rename statement to see the full definition of a created pipeline, use the export pipeline statement when you finish with the load, you can use the drop pipeline statement to remove the data pipeline you can create user defined data pipeline functions using the create pipeline function statement and remove the function using the drop pipeline function statement also, you can administer privileges for data pipelines and data pipeline functions for details, see docid\ asr8r6xqiyofgaz5qnbiw create pipeline create pipeline defines a data pipeline that you can execute with the start pipeline sql statement specify the type of load, data source, and data format you must have the alter privilege on the pipeline for details, see docid\ asr8r6xqiyofgaz5qnbiw syntax create \[ or replace ] \[ continuous | batch ] pipeline \[ if not exists ] pipeline name \[ \<advanced pipeline options> ] \[ bad data target (\<kafka bad data target>) ] source (\<s3 source> | \<filesystem source> | \<kafka source>) extract (\<delimited extract options> | \<json extract options> | \<parquet extract options> | \<binary extract options>) \[ \<general extract options> ] into destination table name select expression as col alias target, expression2 as col alias target2, \[ where filter expression ] / / / source options / / / s3 source s3 bucket bucket name \<s3 source paths> \<file based source options> \<file monitor options> \[ endpoint s3 endpoint ] \[ access key id access key credentials ] \[ secret access key secret key credentials ] \[ max concurrency parallel connections ] \[ read timeout num seconds ] \[ region region ] \[ request depth num requests ] \[ request retries num retries ] \[ headers headers ] filesystem source filesystem \<file based source paths> \<file based source options> \<file monitor options> s3 source paths \<file based source paths> | \<object key source paths> file based source paths (filter | filter glob | filter regex) file filter \[ prefix file prefix ] object key source paths object key object key kafka source kafka topic topic name bootstrap servers bootstrap servers \[ write offsets write offsets ] \[ config config option ] \[ auto offset reset ] file based source options \[ compression method 'gzip' ] \[ sort by ('filename' | 'created' | 'modified') \[ sort direction (asc | desc) ] ] \[ start timestamp start timestamp ] \[ end timestamp end timestamp ] \[ start filename start filename ] \[ end filename end filename ] file monitor options file monitor \<general monitor options> \<sqs monitor options> | \<kafka monitor options> general monitor options monitor type \['sqs' | 'kafka'] \[ polling interval second polling interval second ] \[ batch min file count batch min file count ] \[ batch timeout second batch timeout second ] \[ deduplication boundary target hours deduplication boundary target hours ] sqs monitor options sqs queue url sqs queue url 	sqs endpoint sqs endpoint 	\[ access key id access key id ] 	\[ secret access key secret access key ] 	\[ region region ] 	 kafka monitor options bootstrap servers bootstrap servers 	topic topic 	auto offset reset \['latest' | 'earliest'] \[ group id group id ] \[ consume batch size consume batch size ] \[ consume timeout ms consume timeout ms ] \[ file info selectors file info selectors ] / / / extract options / / / general extract options \[ charset name charset name ] \[ column default if null column default if null ] delimited extract options format ('delimited' | 'csv') \[ comment char comment char ] \[ empty field as null empty field as null ] \[ escape char escape char ] \[ field delimiter field delimiter ] \[ field optionally enclosed by enclosure char ] \[ num header lines num header lines ] \[ record delimiter record delimiter ] \[ skip empty lines skip empty lines ] \[ open array open array char ] \[ close array close array char ] \[ array element delimiter array delimiter ] \[ open object open object char ] \[ close object close object char ] json extract options format 'json' parquet extract options format 'parquet' binary extract options format 'binary' record length length in bytes \[ endianness ('big' | 'little') ] \[ auto trim padding auto trim padding ] \[ padding character padding character ] / / / bad data target options / / / kafka bad data target kafka topic topic name bootstrap servers bootstrap servers \[ config config option ] / / / advanced options / / / advanced pipeline options \[ cores processing cores ] \[ partitions file partitions ] \[ batch size number of rows ] \[ record number format record number format ] the file monitor option only works with an {{aws}} s3 and file system source this option is required for continuous data pipelines that you create using the continuous keyword for {{kafka}} , you cannot use this option pipeline identity and naming the name of a data pipeline is unique in an {{ocient}} system reference the pipeline in other sql statements like start pipeline , stop pipeline , and drop pipeline using the name of the pipeline the sql statement throws an error if a pipeline with the same name already exists unless you specify the if not exists option you can rename a pipeline with the alter pipeline sql statement update a pipeline you can update a pipeline with the or replace clause in the create pipeline sql statement use this clause when you want to continue loading from the current place in a continuous load, but you need to modify transformations or other settings if you specify the or replace clause and the pipeline already exists, the database replaces the original pipeline object with the options specified in the new create or replace pipeline statement when you replace an existing pipeline, the pipeline retains its current position in the source data so that data is not duplicated when the pipeline is resumed with a start pipeline sql statement first, you must stop a pipeline before executing the create or replace sql statement batch and continuous data pipeline modes you can define pipelines in either batch or continuous mode file sources (e g , s3 , filesystem ) support batch and continuous modes file based loads default to batch mode if you do not specify this keyword kafka only supports continuous mode loads with a kafka source default to continuous mode if you do not specify this keyword when you execute the start pipeline sql statement, for batch mode, the system creates a static list of files with the pending status in the sys pipeline files system catalog table for continuous mode, the monitor appends new incoming files to the list of files in the sys pipeline files system catalog table required options you must define certain options in every pipeline optional options need to be present only if you use specific functions a pipeline must contain the source , extract format , and into table name select statements /#source options /#extract options docid 1 jhzrhblgnqucl6skiaq docid\ aimcafoydn2xf fgqssys these options depend on each other if not exists and or replace sql statements are mutually exclusive select statement and data transformation use the into table name select sql statement in the create pipeline sql statement to specify how to transform the data the select statement includes a set of one or more expressions and a target column name in the form expression as column name the expression part in the statement contains a source field reference (e g , $1 or $my field subfield ) and, optionally, the transformation function you want to apply if your data is nested in arrays, you can use special transformation functions, such as the docid 7h6mczxhldiameojdksvu function, to expand the data into individual rows for details about data transformation and supported transformation functions, see docid\ aimcafoydn2xf fgqssys for details about data types and casting, see docid 7s5nztl8lolpwt2pcnjyz and docid 1 jhzrhblgnqucl6skiaq you can also specify metadata values, such as filename, to load in the select sql statement for details, see docid\ a3n4wkcawrpo1gtefetmm this is an example of a transformation statement snippet format json into public orders select timestamp(bigint($created timestamp)) as ordertime, metadata('filename') as source filename, $order number as ordernumber, $customer first name as fname, left($customer middle initial,1) as minitial, $customer last name as lname, $postal code as postal code, $promo code as promo code, $order total as ordertotal, decimal($tax,8,2) as tax, char\[]\($line items\[] product name) as product names, char\[]\($line items\[] sku) as skus optionally, you can filter the load by using the where clause in the form where filter expression this expression should evaluate to a boolean value or null and can include more than one filter expression the system loads rows that contain data matching the filter criteria when the expression evaluates to true the system does not load the rows when the expression evaluates to false or null you can include any transform in the where clause as in the select clause this is an example of a filter snippet that loads order data for customers whose last name starts with the letter a use the coalesce function to return only non null values format json into public orders select timestamp(bigint($created timestamp)) as ordertime, metadata('filename') as source filename, $order number as ordernumber, $customer first name as fname, left($customer middle initial,1) as minitial, $customer last name as lname, $postal code as postal code, $promo code as promo code, $order total as ordertotal, decimal($tax,8,2) as tax, char\[]\($line items\[] product name) as product names, char\[]\($line items\[] sku) as skus where startswith(coalesce($customer last name, ''),'a') null and default value handling the ocient system has specific ways to handle null and default values in the data pipeline consider this information when you prepare data to load and write the create table and create pipeline sql statements to load null values, insert the null value in the data in this case, if you specify not null for the target column in the create table statement, the data pipeline fails to load if you omit a column in the select sql statement, the data pipeline loads the default value for the column if you do not specify the default value, the pipeline loads a null value if you specify not null for the target column, the data pipeline also fails to load you can use the default keyword to load a default value in this case, if the column does not have a default value, the pipeline fails to load if the pipeline loads a null into a column with a specified default value, you can use coalesce(\<value>, default) to insert the default value instead of the null value, where \<value> is the null column you can modify the load of one column at a time in this way this table describes the data pipeline behavior for null or omitted column values value in the data pipeline nullable target column default value in target column resulting data pipeline behavior null no value might or might not be set the pipeline uses the default value if the default exists and you specify the column default if null option otherwise, the pipeline fails null yes value might or might not be set the pipeline uses the default value if the default exists and you specify the column default if null option otherwise, the pipeline uses the null value omitted column in the select sql statement null value might or might not be set yes the pipeline uses the default value omitted column in the select sql statement no no the pipeline fails omitted column in the select sql statement yes no the pipeline uses the null value required privileges you must have the create pipeline privilege on the underlying database and the view privilege on each table in the pipeline definition to execute the create pipeline sql statement the table must already exist see the start pipeline sql statement for the required privileges to execute a pipeline for details, see docid\ asr8r6xqiyofgaz5qnbiw limitations a pipeline can contain, at most, 4,194,304 files when loading more than 4,194,304 files, you must partition files into smaller batches and serially execute multiple pipelines use prefix and filter settings to control these partitions it is not recommended to create multiple pipelines with a total of more than 4,194,304 files across the entire system simultaneously examples load json data from kafka this example loads json data from kafka using the create pipeline sql statement use the bootstrap server 192 168 0 1 9092 using the kafka topic orders load data into the public orders table specify the data to load as these json selectors identifier $id user identifier $user id product identifier $product id subtotal amount $subtotal tax $tax total amount $total discount amount $discount created time $created at quantity $quantity create pipeline orders pipeline source kafka bootstrap servers '192 168 0 1 9092' topic 'orders' extract format json into public orders select $id as id, $user id as user id, $product id as product id, $subtotal as subtotal, $tax as tax, $total as total, $discount as discount, $created at as created at, $quantity as quantity; for a complete tutorial, see docid 2ua4kqafcqplu gi6oorz load delimited data from s3 this example loads delimited data in csv format from s3 use the https //s3 us east 1 amazonaws com https //s3 us east 1 amazonaws com endpoint with the ocient examples bucket and path metabase samples/csv/orders csv specify one header line with the num header lines option load data into the public orders table specify the data to load using the column numbers idenfier user identifier product identifier subtotal amount tax total amount discount amount created time quantity create pipeline orders pipeline source s3 endpoint 'https //s3 us east 1 amazonaws com' bucket 'ocient examples' filter 'metabase samples/csv/orders csv' extract format csv num header lines 1 into public orders select $1 as id, $2 as user id, $3 as product id, $4 as subtotal, $5 as tax, $6 as total, $7 as discount, $8 as created at, $9 as quantity; for a complete tutorial, see docid 5xxtimfhjnlyxs48wxsxs continuous file load of delimited data in csv files create a data pipeline that uses continuous file loading with delimited data in csv files specify to use 32 partitions and 16 cores with the partitions and cores options, respectively the source is s3 with the http //endpoint ocient com endpoint and cs data bucket specify the filter ' csv' to find all files with a filename that matches a glob pattern without subdirectories, for example, data csv the system filters filenames with subdirectories such as data/data sample csv for continuous file loading, specify the file monitor option use the kafka monitor by specifying the monitor type option use 15 seconds as the timeout for waiting before creating another loading job with pending files with the batch timeout second option using the batch min file count option, specify a minimum of 10 pending files to start a new load then, specify 10 hours for looking back for file deduplication with the deduplication boundary target hours option specify the test broker 9092 bootstrap server, cfl kafka ten adtech flat small kafka topic, and reset the offset to the smallest offset using the earliest value of the auto offset reset option set the client group identifier to 84079bf1 bdc4 4b10 ba12 41ba6b17dffe allow a maximum of 10 messages to poll each time with a timeout of 1,000 milliseconds using the consume batch size and consume timeout ms options, respectively the format is csv with the record delimiter as the newline character \n there are 39 fields in the csv data load data in the public ad sessions table the select statement identifiers the columns to load by number for each column, transform each column using cast functions for details, see docid\ bvivszia9qovupkl4waxs for each function see docid\ zcon ufstf4uhc5airgpg for the to timestamp function create continuous pipeline test continuous pipeline partitions 32 cores 16 source s3 endpoint 'http //endpoint ocient com/' bucket 'cs data' filter ' csv' file monitor( monitor type kafka batch timeout second 15 batch min file count 10 deduplication boundary target hours 10 bootstrap servers 'test broker 9092' topic 'cfl kafka ten adtech flat small' auto offset reset 'earliest' group id '84079bf1 bdc4 4b10 ba12 41ba6b17dffe' consume batch size 10 consume timeout ms 1000 ) extract format csv record delimiter e'\n' num fields 39 as source field into public ad sessions select to timestamp(char($1), 'yyyy mm dd hh\ mm\ ss ssssss', 'java') as event date time, char(null if($2, '')) as device model, tinyint($8) as device user age, boolean($10) as device ad tracking disabled, binary(null if($11, '')) as device mac, int($14) as ip zip, float($19) as ip zip latitude, double($20) as ip zip longitude, bigint($21) as session id, smallint($32) as session response latency, decimal($34, 10, 1) as session transaction revenue, char($39) as session app name from source field; source options file based source options options that apply to both the s3 and filesystem sources option key default required or optional data type description filter | filter glob | filter regex none required string the expression for filtering files in the directory or child directories to load this pattern applies to the full path to the file except the bucket for s3 compatible file sources ℹ️ paths include a leading forward slash specify one of these options filter glob or filter — regular {{unix}} filename patterns supported options include indicates unlimited wildcard characters except for the path character, / indicates wildcard characters including path separators ? indicates a single wildcard character filter regex — regular expression patterns to filter files some common patterns include matches any character matches any 0 or more of the preceding character + matches a single character \[135] matches any one character in the set \[1 5] matches any one character in the range (a|b) matches a or b ℹ️ when you use the filter or filter glob options with a continuous data pipeline, the system supports only basic globbing, and extended, range globbing is not supported filter glob examples list all csv files in subdirectories named 2024 that are in any subdirectory of the trades directory, which is in the root of the bucket filter glob = '/trades/ /2024/ csv' list all csv files in the bucket in all subdirectories filter glob = ' / csv' filter regex examples list all json gz files in the bucket that contain the name orders anywhere in the path filter regex = ' orders \\ json\\ gz' list all files in the bucket in the root path metrics or values where the date string on the filename is 2000 to 2004 filter regex = '/(metrics|values)/ 200\[0 4] ' prefix none optional string or array of strings specify a prefix in addition to a filter the prefix must end with a forward slash the ocient system uses the more specific option between the specified prefix and filter paths include a leading forward slash the system applies the prefix to s3 sources to restrict the scope of listing operations and improve performance use this option along with filter regex to enforce the listing prefix of your choice when you apply a list of prefixes, the system loads the union of the set of files found in each of the paths specified by the prefix ℹ️this option is not available for continuous data pipelines examples list all csv files in the 2024 subdirectory of files prefix '/files/2024/' filter ' / csv' list all json files of subdirectory 2024/09/ in the orders directory prefix '/data/orders/2024/09/' filter regex ' /orders/ json' list all json files in multiple subdirectories prefix \['/data/orders/2024/09/', '/data/orders/2024/10/'] object key none optional string or array of strings specify an s3 object key to load you cannot use this option if the prefix option or any filter option is set ℹ️ object keys do not include a leading forward slash the pipeline loads the object key or keys listed in the pipeline this way can be faster for loading data by avoiding listing of files in large directories object keys must not include any of the special characters ?\\{}\[] ℹ️this option is not available for continuous data pipelines examples load a single object from the designated bucket at the specified object key object key 'order data/jsonl/orders 20251101 jsonl' load a list of objects from the designated bucket at the specified object keys object key \['order data/jsonl/orders 20251101 jsonl', 'order data/jsonl/orders 20251201 jsonl'] compression method none optional string the method the ocient system uses to decompress file data the supported option is gzip to load uncompressed data, omit the compression method option ℹ️the compression method option is only applicable to file based sources kafka pipelines automatically decompress data based on the kafka topic configuration sort by filename optional string the sort criteria for sorting the file list before the load supported options are filename — sort files lexicographically by the filename created — sort files based on the t imestamp that indicates the creation of the file modified — sort files based on the t imestamp that indicates the modification of the file ℹ️this option is not available for continuous data pipelines sort direction asc optional string the sort direction, either asc ascending or descending desc , determines the sort order if you specify this option, you must also specify the sort by option ℹ️this option is not available for continuous data pipelines start timestamp none optional timestamp the iso 8601 compliant date or date time that is used as the lower bound to filter files for batch pipelines the time zone should match the file metadata (inclusive) when you use this option with the sort by option with the value set to filename or modified , then the system uses the modification timestamp for the file to compare the value when you use this option with the sort by option and the created value, then the system uses the creation timestamp for the file to compare the value you can use this option without the end timestamp option if you specify the end timestamp option, you must specify the start timestamp option before the end timestamp option ℹ️this option is not available for continuous data pipelines end timestamp none optional timestamp the iso 8601 compliant date or date time that is used as the upper bound to filter files for batch pipelines the time zone should match the file metadata (inclusive) when you use this option with the sort by option with the value set to filename or modified , then the system uses the modification timestamp for the file to compare the value when you use this option with the sort by option and the created value, then the system uses the creation timestamp for the file to compare the value you can use this option without the start timestamp option if you specify the end timestamp option, you must specify the start timestamp option before the end timestamp option ℹ️this option is not available for continuous data pipelines start filename none optional string the filename string that is the lower bound to filter for files lexicographically for batch pipelines (inclusive) use the full path of the file, such as '/dir/load file json' if the file is located in the top most directory, start the path with a slash / , such as '/load file json' you can use this option without the end filename option if you specify the end filename option, the value of the start filename option must be smaller than the value for the end filename option lexicographically end filename none optional string the filename string that is the upper bound to filter for files lexicographically for batch pipelines (inclusive) use the full path of the file, such as '/dir/load file json' if the file is located in the top most directory, start the path with a slash / , such as '/load file json' you can use this option without the start filename option if you specify the end filename option, the value of the start filename option must be smaller than the value for the end filename option lexicographically s3 source options you can apply these options to data sources of the source s3 type, which include s3 and s3 compatible services option key default required or optional data type description bucket none required string the name of the bucket in aws s3 endpoint none optional string the endpoint uri for the s3 compatible service api (e g , https //s3 us east 2 amazonaws com ) if you provide the endpoint option, the ocient system ignores settings for the region option region 'us east 1' optional string the region that the ocient system uses for aws access if you specify the endpoint option, the system ignores this option access key id optional string the access key identification for aws credentials if you specify this option, you must also specify the secret access key option the ocient system uses anonymous credentials when you specify an empty value for this option secret access key optional string the secret key for aws credentials if you specify this option, you must also specify the access key id option the ocient system uses anonymous credentials when you specify an empty value for this option max concurrency 50 optional integer determines the number of parallel connections the ocient system uses to communicate with the aws s3 service ⚠️ this option does not require modification in most cases contact ocient support to modify these values read timeout 0 (unlimited timeout) optional integer the number of seconds until a read operation times out ⚠️ this option does not require modification in most cases contact ocient support to modify these values request depth 500 optional integer the upper boundary of requests that the ocient system handles concurrently ⚠️ this option does not require modification in most cases contact ocient support to modify these values request retries 10 optional integer number of times the aws sdk retries failing requests before the ocient system throws an error ⚠️ this option does not require modification in most cases contact ocient support to modify these values headers none optional string the headers to send with every request this option is a json formatted string represent the chosen header names as keys with corresponding values as scalars or lists of scalars the system converts scalars that are not strings to strings during the load, the system maps each element in a list to the header name represented by the corresponding key examples headers '{"x amz request payer" "requester"}' returns this header for each request x amz request payer requester headers '{"header name" \["list", "of", "values"]}' returns this header for each request header name list, of, values filesystem source options no options exist specific to the file system source ( source filesystem ) except for general /#file based source options when you load data using source filesystem , the files must be addressable from all of your loader nodes the ocient system uses the specified path in the pipeline prefix and filter options to select the files to load a shared view of the files you want to load must be available to all loader nodes involved in a pipeline for example, you can use a network file system (nfs) mount available to all loader nodes at a common path on each node example this example create pipeline sql statement snippet contains a filesystem source and filters to all csv files in the /tmp/sample data/ directory on each of the loader nodes create pipeline source filesystem prefix '/tmp/sample data/' filter ' / csv' extract format delimited into public orders select $1 as username, $2 as subtotal, kafka source options you can apply these options to kafka data sources ( source kafka ) for compression, you do not need to specify a compression option in kafka based pipelines, because the ocient system handles the compression type automatically records produced to the kafka broker with a compression type setting or with the compression type set on the topic automatically decompress when the loading process consumes the records the loading process uses built in headers in kafka to determine the required decompression during extraction option key default required or optional data type description topic none required string the name of the kafka topic indicates where to consume records bootstrap servers none required string a comma delimited list of ip\ port pairs that contain the ip addresses and the associated port numbers of the kafka brokers you can also use a hostname instead of the ip address example bootstrap servers = '198 51 100 1 9092,198 51 100 2 9092' write offsets true optional boolean indicates whether the kafka consumer should write its durably made record offsets to the kafka broker config '{ "enable auto commit" false, "key deserializer" "org apache kafka common serialization bytearraydeserializer", "value deserializer" "org apache kafka common serialization bytearraydeserializer", "group id" \<database name> \<pipeline name> }' optional string the https //docs confluent io/platform/current/installation/configuration/consumer configs html that the kafka consumers should use this option is a json formatted string certain values within this configuration are fixed, whereas the ocient system provides other values with a default value that you can modify auto offset reset none optional string determine which action to take for the kafka configuration when there is no initial offset in the offest store or the specified offset is out of range supported values are 'smallest' or 'earliest' — automatically reset the offset value to the smallest value 'largest' or 'latest' — automatically reset the offset value to the largest value 'error' — throw an error for the consumer configuration, to create a secure connection from the kafka consumer to a kafka broker, set the "security protocol" key along with any ssl or sasl keys if a certificate file is required, you must add it to the truststore used by the {{jvm}} on all loader nodes the truststore path must be identical on all loader nodes the kafka configuration can reference this truststore path kafka config option defaults when you configure the config option for a kafka source, the ocient system merges the values you specify with default values that ocient uses when the system creates a kafka consumer you cannot override some values that ocient provides using the config option key default value override allowed group id \<database name> \<pipeline name> yes enable auto commit false no key deserializer "org apache kafka common serialization bytearraydeserializer" no value deserializer "org apache kafka common serialization bytearraydeserializer" no continuous file loading source options specify these options for data pipelines that use s3 and file system sources with the continuous mode for a kafka source, do not use these options general file monitor options the file monitor option is required for data pipelines with the continuous mode and s3 and file system sources option key default required or optional data type description monitor type none required string the type of monitor if the specified string is not one of the defined monitors, compilation fails valid values are sqs or kafka polling interval second 10 optional int number of seconds to consume the event topic valid value range 10 to 120 batch timeout second 60 optional int when the number of pending files is smaller than batch min file count , the number of seconds to wait before creating another loading job with pending files valid value range 10 to 1800 batch min file count 100 optional int the minimum number of pending files to start a new load valid value range 10 to 2000 deduplication boundary target hours 24 optional int the number of hours to look back for file deduplication valid value range 0 to 48 sqs monitor options use these options when you set the monitor type option to sqs for {{sqs}} option key default required or optional data type description sqs endpoint none required string the endpoint url for the client for example http //localhost 32769 sqs queue url none required string the url of the target queue for example http //localhost 32769/000000000000/queue1 access key id \<empty> optional string access key identifier for sqs authentication access key id and secret access key must be provided as a pair if you do not provide a secret, continuous file loading reverts to anonymous access secret access key \<empty> optional string the secret access key for sqs authentication region us east 1 optional string the region for sqs operation kafka monitor options use these options when you set the monitor type option to kafka for kafka option key default required or optional data type description bootstrap servers none required string the bootstrap servers for the kafka consumer configuration the string contains a list of brokers as a comma separated list of the broker hostnames or broker hostnames and port number combinations, in the format hostname\ port topic none required string the name of the kafka topic indicates where to consume records auto offset reset none required string action to take for the kafka configuration when there is no initial offset in the offset store or the chosen offset is out of range 'smallest','earliest' — automatically reset the offset to the smallest offset 'largest','latest' — automatically reset the offset to the largest offset 'error' — trigger an error ( err auto offset reset ) retrieved by consuming messages and checking 'message >err' group id cfl optional string client group identifier for the kafka configuration all clients share the same group the identifier belongs to the same group and therefore shares the same committed offset consume batch size 100 optional int the maximum number of messages to poll each time consume timeout ms 1000 optional int the operation timeout, in milliseconds, that controls how long the consume request waits for the response file info selectors \[$"records"\[1] s3 object key, $"records"\[1] "eventtime", $"records"\[1] s3 object size] corresponds to a s3\ objectcreated\ put event optional array(string) defines three json selectors for the filename, the last modification timestamp of the file, and the file size of an incoming message in a monitor the selector must follow this order the default value works for s3\ objectcreated events if you override this default configuration, you can parse custom json messages the format of the individual fields must still adhere to s3 standards for details, see https //docs aws amazon com/amazons3/latest/userguide/notification content structure html the message must be in the json format config '{ "enable auto commit" false, "group id" "cfl", "max poll interval ms" "600000", "heartbeat interval ms" "6000", "session timeout ms" "600000" }' optional string the https //docs confluent io/platform/current/installation/configuration/consumer configs html that the kafka consumers should use this option is a json formatted string certain values within this configuration are fixed, whereas the ocient system provides other values with a default value that you can modify the same kafka config option override considerations apply for details, see /#kafka source options extract options general extract options you can specify these options on any of the allowed format types option key default required or optional data type description format none required string specifies the format of the files to load supported values are delimited csv json binary charset name binary format ibm1047 all other formats utf 8 optional string specifies the character set for decoding data from the source records into character data use this character set when you load data into varchar columns or when you apply char transformation functions defaults to utf 8 for all formats except binary, which defaults to ibm1047 you can configure the default value for binary formatted data using a sql statement such as alter system alter config set 'sql pipelineparameters extract binary defaultcharset' = 'ibm1047' column default if null false optional boolean specifies whether pipelines should load the column default value when the result of a series of transforms is null if you set this option to false (the default), the pipeline loads null values into columns if you set this option to true , the pipeline loads the defined default value of the column when the result of the execution of transformation on the column is null null strings delimited or csv format \['null', 'null'] json format \[] optional array of strings specifies string values that should represent a null value when extracted from the source records use this in csv , delimited , and json formats to convert specific values to null instead of requiring individual transform function calls to null if with those values this option applies to source data immediately after extraction if the result of transformations is one of these strings, you must use null if to transform to null with the specified string values delimited and csv extract options you can specify these options for format delimited or format csv data formats, which are aliases for details on working with delimited data, see docid 1 jhzrhblgnqucl6skiaq option key default required or optional data type description comment char null optional string specifies the character used to comment out a record in the source file the load skips records where the first character of a record is equal to this character set this option to null or '' to turn off the detection of these control characters example comment char '#' empty field as null true optional boolean specifies whether the ocient system should extract an empty source field as null or a missing value when this option is set to true , the ocient system treats empty fields as null otherwise, the system treats fields as a missing value for string type fields, a missing value is equivalent to an empty string define an empty field as two consecutive delimiters (e g , the second field is empty in abc,,xyz ) if a field is explicitly an empty string as indicated by quote characters, the ocient system treats the field as an empty string, not an empty field (e g , the second field is an empty string in abc,"",xyz ) the char($1) transformation function handles both nulls and empty strings and passes them through ⚠️ beware that the null if($1, '') transformation function directly loads a null for both nulls and missing values this transformation can override the behavior of empty field as null field optionally enclosed by " optional string also known as the “quote character,” this option specifies the character for optionally enclosing fields fields enclosed by this character can include delimiters, the enclosure character, or the escape character set this option to null or '' to turn off the detection of these control characters examples use a double quote character field optionally enclosed by = '"' use a single quote character field optionally enclosed by = '''' escape char " optional string specifies the escape character within fields enclosed by the field optionally enclosed by option use this option to escape the enclosure character or escape character set this option to null or '' to turn off the detection of these control characters examples use a double quote as the escape character escape char = '"' use a single quote as the escape character escape char = '''' field delimiter ',' optional string or array of strings specifies a character or list of possible characters for delimiting fields the default value sets the field delimiter to only a comma the value must be one byte the ocient system automatically interprets the values you specify as c style escaped strings this means you do not need to specify an escape string ( e'some value' ) is not required to specify control characters this differs from the default string behavior in ocient for details, see docid\ qcf0x9ao4a56x id39pkr field delimiter and field delimiters are aliases examples use a tab character field delimiter = '\t' use a pipe character field delimiter = '|' use either a pipe or a comma character field delimiter = \[',', '|'] num header lines 0 optional integer specifies the number of header lines, typically 0 or 1 the ocient system skips this number of lines and does not load them as data when files are processed use this option when your data includes a row of header values record delimiter \['\r\n', '\n'] optional string or array of strings specifies the string or an array of strings for delimiting records the file is split into individual records using this character during processing common values include '\r\n' and '\n' the value must be one or two bytes the ocient system automatically interprets the values you specify as c style escaped strings an escape string ( e'some value' ) is not required to specify control characters this differs from the default strings behavior in ocient the system chooses the first specified delimiter and uses that delimiter for the rest of the data data with mixed delimiters is not supported for details, see docid\ qcf0x9ao4a56x id39pkr record delimiter and record delimiters are aliases examples use a linefeed character record delimiter = '\n' use a carriage return and linefeed character sequence record delimiter = '\r\n' skip empty lines false optional boolean specifies whether or not to skip empty lines close array ']' optional string specifies the character that indicates the end of an array in a csv or delimited field use this option to parse array data types specify the open array option also when using this option set this option to null or '' to turn off the detection of these control characters if you set this option to either of these characters, the system also turns off the detection of these characters for the open array option example convert source data such as val1,"{1,2,3}",val2 to an array when referenced as $2\[] close array '}' open array '{' array element delimiter ',' open array '\[' optional string specifies the character that indicates the start of an array in a csv or delimited field use this option to parse array data types specify the close array option also when using this option set this option to null or '' to turn off the detection of these control characters if you set this option to either of these characters, the system also turns off the detection of these characters for the close array option example convert source data such as val1,"\[1,2,3]",val2 to an array when referenced as $2\[] open array '\[' close array ']' array element delimiter ',' close object '}' optional string specifies the character that indicates the end of a tuple in a field use this option to parse tuple data types specify the open object option also when using this option set this option to null or '' to turn off the detection of these control characters if you set this option to either of these characters, the system also turns off the detection of these characters for the open object option open object '{' optional string specifies the character that indicates the start of a tuple in a field use this option to parse tuple data types specify the close object option also when using this option set this option to null or '' to turn off the detection of these control characters if you set this option to either of these characters, the system also turns off the detection of these characters for the close object option array element delimiter ',' optional string specifies the character that separates values in an array within a csv or delimited field use this option to parse array data types set this option to null or '' to turn off the detection of these control characters example convert source data such as val1,"\[1;2;3]",val2 to an array when referenced as $2\[] array element delimiter = ';' when you specify the escape character, you often have to use an escape sequence this action follows standard sql rules json extract options no options exist for json data record extraction ( format json ) for details about json formatted data, see docid 1 jhzrhblgnqucl6skiaq parquet extract options no options exist for {{parquet}} data record extraction ( format parquet ) for details about parquet formatted data, see docid 1 jhzrhblgnqucl6skiaq when you use the format parquet option with an aws s3 source, the endpoint option is required binary extract options you can apply these options to binary data record extraction ( format binary ) for details about binary formatted data, see docid 1 jhzrhblgnqucl6skiaq the general option charset name has a different default value for format binary option key default required or optional data type description record length none required integer specifies the fixed size in bytes of each record in the source data the ocient system splits the binary data into binary chunks according to this length value and processes them individually endianness 'big' optional string specifies the endianness used to interpret multi byte sequences in various transforms of binary data accepted values are ‘big' and 'little' auto trim padding true optional bool determines if padding characters should be trimmed after decoding the binary data into string data if you set this option to true , the ocient system trims all instances of the padding character value from the end of a string after the system decodes the string from binary type padding character (space) optional string the padding character from the string after the system decodes the string from binary type you can change the default padding character for binary formatted data using a sql statement such as alter system alter config set 'sql pipelineparameters extract binaryformat defaultpaddingcharacter' = ' ' the ocient system trims the default padding character of a space from the end of the text data in binary data bad data targets bad data represents records that the ocient system could not load due to errors in the transformations or invalid data in the source records you can provide options for a bad data target that the ocient system uses during pipeline execution to capture the records that are not loaded the original bytes that the pipeline tried to load are captured in the bad data target along with the metadata about the error, such as the error message or source kafka is the only supported bad data target kafka bad data target when you use kafka as a bad data target, the ocient system produces the original bytes of the source record into the kafka topic of your choice the ocient system includes the metadata about the record in the header of the record as it is sent to kafka you can configure the kafka topic on your kafka brokers using the retention and partition settings of your choice in the event that the kafka broker is unreachable when the ocient system attempts to produce a bad data record to the bad data target, the system logs an error on the loader node and the pipeline continues example this example create pipeline sql statement snippet contains a bad data target definition using the bad data target option create pipeline bad data target kafka topic 'orders errors' bootstrap servers '111 11 111 1 9092,111 11 111 2 9092' config '{"compression type" "gzip"}' source extract into public orders select $order billing name as username, $order subtotal as subtotal, kafka bad data target options option key default required or optional data type description topic none required string the name of the kafka topic where the ocient system should produce bad data records bootstrap servers none required string a comma delimited list of ip\ port pairs that contain the ip addresses of the kafka brokers and the associated port number example bootstrap servers = '111 11 111 1 9092,111 11 111 2 9092' config '{ "compression type" "none" }' optional string the https //docs confluent io/platform/current/installation/configuration/producer configs html that the kafka producer should use this option is a json formatted string advanced pipeline tuning options you can use pipeline tuning options to control the parallelism or batching dynamics of your pipelines this tuning can throttle the resources used on a pipeline or increase parallel processing across loader nodes these options are advanced settings that might require a detailed understanding of the underlying mechanics of the loading infrastructure in the ocient system to employ due to the inherent nature of each source type, the behavior of these parameters can differ between file based and kafka based loads option key default required or optional data type description cores the maximum number of cpu cores available on each loader node optional integer maximum number of processing threads that the ocient system uses during execution on each loader node the system creates this number of threads on each loader node the ocient system automatically determines the default value by finding the number of cores of a loader node you can use this option for performance tuning the calculation for maximum parallelism of a pipeline is number of loaders cores about kafka partitions and parallelism for kafka loads, this option determines the number of kafka consumers created on each loader node for kafka pipelines, the recommendation is that number of loaders cores equals the number of kafka topic partitions if this number exceeds the number of kafka topic partitions, the work might spread unevenly across loader nodes if this number is less than the number of kafka topic partitions, some kafka consumers might receive uneven amounts of work in this case, use a value for number of loaders cores that is an even divisor of the number of kafka topic partitions to avoid a skew in the rates of processing across partitions partitions equal to the value of cores optional integer specifies the number of partitions over which to split the file list not applicable to kafka loads the ocient system automatically sets a default value based on the configured value for the cores option you can use this option for performance tuning the number of partitions determines how many buckets of work the ocient system generates for each batch of files processed on a loader node the pipeline processes this number of partitions in parallel using the specified number of cores if you specify fewer partitions than cores, some cores are not fully utilized, and resources are wasted if you specify more partitions than cores, the ocient system divides partitions in a round robin fashion over the available cores batch size a dynamic value, determined by the ocient system for each pipeline to maximize performance optional integer number of rows in the batch to load at one time the ocient system automatically calculates a dynamic value depending on the table columns and the utilization of internal buffers to transfer records to the database backend you can use this option to turn off the dynamic adjustments for performance tuning ⚠️ only change this setting in rare cases where loading performance is slower than expected, and you have a large record size if this setting is improperly set, pipelines might fail with out of memory exceptions you can configure the default value (for the batch payload target) using a sql statement such as alter system alter config set 'streamloader extractorengineparameters configurationoption osc batch payload target' '65536' record number format for file loads that do not use the docid 7h6mczxhldiameojdksvu function, the default is \[19, 45, 0] for kafka loads that do not use the explode outer function, the default is \[0, 64, 0] for loads that use the explode outer function, the default is the specified load type specific default value with 13 subtracted from the record index bits the system adds these bits to the bits for rows within a record for example, the default for file loads that use this function is \[19, 32, 13] optional array the 64 bit record number for each record of the load this number uniquely identifies a row within its partition the format is an array with three values in the format \[\<file index bits>, \<record index bits>, \<rows per record index bits>] the file index bits \<file index bits> value is the number of bits used to represent the file index within a partition the record index bits \<record index bits> value is the number of bits used to represent the record index within a file the rows per record index bits \<rows per record index bits> is the number of bits used to represent the row within a record the system uses this value with the explode outer function these three values must sum to 64 example record number format= \[10, 54, 0] set the number of file index bits to 10 and the number of record index bits to 54 , allowing up to 2^10 files and 2^54 records per file the system does not support the explode outer function in this configuration because the rows per record index bits are 0 drop pipeline drop pipeline removes an existing pipeline in the current database you cannot remove a pipeline that is running you must have the drop privilege on the pipeline to execute this sql statement for details, see docid\ asr8r6xqiyofgaz5qnbiw when you drop a pipeline, the ocient system also removes the associated system catalog information, such as pipeline errors, events, files, partitions, and metrics syntax drop pipeline \[ if exists ] pipeline name \[, ] parameter data type description pipeline name string the name of the specified pipeline to remove you can drop multiple pipelines by specifying additional pipeline names and separating each with commas examples remove an existing pipeline named ad data pipeline drop pipeline ad data pipeline; remove an existing pipeline named ad data pipeline or return a warning if the ocient system does not find the pipeline in the database drop pipeline if exists ad data pipeline; preview pipeline preview pipeline enables you to view the results of loading data for a specific create pipeline sql statement without creating a whole data pipeline and without storing those results in the target table using this sql statement, you can iterate quickly and modify the syntax as needed to achieve your expected results after you confirm your expected results, you can use the same syntax in the body of the create pipeline statement with the appropriate source a table must exist in the database to serve as the target of your preview pipeline statement this table ensures the pipeline matches the column types of the target table however, the execution of this statement does not load data into the target table preview sources the source inline source type is available only for the preview pipeline sql statement you cannot create a data pipeline with inline source data other source types defined in the create pipeline statement, s3 , kafka , and filesystem , are compatible with the preview pipeline statement the extract options vary by the source type to mirror the create pipeline statement the database returns 10 records by default preview error handling pipeline level errors cause the preview pipeline sql statement to fail the ocient system returns an error and no result set however, the ocient system accumulates record level errors that occur during the execution of this statement in a single warning that the system returns along with the result set each line of the warning describes a record level error in human readable form or as a json blob, depending on the value of the show errors as json option rows or columns that encounter record level errors have null values in the result set preview limitations limitations of this sql statement are before executing a preview pipeline sql statement, you must create a table for the ocient system to have context for the preview the column default if null option from the create pipeline sql statement has no effect on the preview pipeline sql statement the preview pipeline sql statement does not honor the assignment of a service class based on text matching these source options are not supported start timestamp end timestamp start filename end filename when you execute two duplicate preview pipeline statements for a specific kafka topic, the two statements share a consumer group if the topic is small, one or both of the result sets might only be a partial result syntax preview pipeline pipeline name \[ mode mode ] \[ show errors as json show errors as json ] source \[ inline ] (inline string | \<s3 source> | \<filesystem source> | \<kafka source>) \[ limit limit ] extract format csv record delimiter record delimiter field delimiters \['delim1', 'delim2', ] \[ intermediate values intermediate values ] into created tablename select preview column formula as preview column name, though this syntax shows the csv format, you can use the preview pipeline statement with the other formats also parameter data type description pipeline name string the name of the specified data pipeline for the preview created tablename identifier the identifier for the name of the table that you create before executing the preview pipeline sql statement preview column formula identifier the identifier for the formula of the data to load for example, for the data in the first field of the inline source, use $1 if you need to add a transformation, you can use functions to transform data, such as concat($1, $2) , to load the concatenation of the first two fields in the inline source data preview column name identifier the name of the column in the target table sql statement options option key default required or optional data type description mode 'transform' optional string indicates whether to perform a validation of the preview pipeline sql statement valid values are 'validate' and 'transform' set this option to 'validate' for checking that the creation of the data pipeline succeeds if the pipeline is valid, the statement produces no output; otherwise, it returns an error set this option to 'transform' to retrieve a preview of the results of the pipeline show errors as json false optional boolean indicates whether to show errors values are true or false if the value is true , the ocient system returns record level errors as json blobs rather than human readable messages source inline none optional string the string that contains data for the preview of the data pipeline load for example, source data can be 'oci,ent,ocient|ware,house,warehouse' , where | is the record delimiter and , is the field delimiter for special characters, such as \t , use an escape sequence such as e'oci,ent,oci\tent|ware,house,ware\thouse' limit 10 optional int the number of rows, specified as an integer, to return in the preview results for sources with many rows the default value is 10 rows intermediate values false optional boolean indicates whether to capture intermediate values during a transformation sequence values are true or false if the value is true , the ocient system appends an extra column to the result set each value in the column contains a json blob that describes the intermediate values processed for each column after each transformation you must specify at least one column name in the select part of the syntax the name of the specified column must match the name of the column in the created table the number of columns in the select part can be less than those in the created table for definitions of other extract options, see the create pipeline sql statement options in /#create pipeline examples preview pipeline using csv format preview the load of two rows of data first, create a table to serve as the context for the load the previewload table contains three columns with these data types string, integer, and boolean create table previewload (col1 varchar, col2 int, col3 boolean); create the preview pipeline testpipeline with this data 'hello,2,true|bye,3,false' specify the csv extract format, | record delimiter, and the , field delimiter load the data without transformation preview pipeline testpipeline source inline 'hello,2,true|bye,3,false' extract format csv record delimiter '|' field delimiters \[','] into previewload select $1 as col1, $2 as col2, $3 as col3; output col1 col2 col3 \ hello 2 true bye 3 false fetched 2 rows delete the previewload table drop table previewload; preview pipeline using csv format with escape characters preview the load of two rows of data first, create a table to serve as the context for the load the previewload table contains three columns with these data types string, integer, and boolean create table previewload (col1 varchar, col2 int, col3 boolean); create the preview pipeline testpipeline with this data 'hello\tworld,2,true|bye\tworld,3,false' specify the csv extract format, | record delimiter, and , field delimiter load the data without transformation in this case, the data contains the special character \t you must escape the character by using the escape sequence e preview pipeline testpipeline source inline e'hello\tworld,2,true|bye\tworld,3,false' extract format csv record delimiter '|' field delimiters \[','] into previewload select $1 as col1, $2 as col2, $3 as col3; output col1 col2 col3 \ hello world 2 true bye world 3 false fetched 2 rows delete the previewload table drop table previewload; preview pipeline using csv format with transformation create a table to serve as the context for the load the previewload table contains three string columns create table previewload (col1 varchar, col2 varchar, col3 varchar); create the preview pipeline testpipeline with this data 'hello,world|bye,world' specify the csv extract format, | record delimiter, and the , field delimiter load the data with a transformation to concatenate the two strings and return the result in the third column preview pipeline testpipeline source inline 'hello,world|bye,world' extract format csv record delimiter '|' field delimiters \[','] into previewload select $1 as col1, $2 as col2, concat($1,$2) as col3; output col1 col2 col3 \ hello world helloworld bye world byeworld fetched 2 rows the third column contains the concatenated result of the first two columns delete the previewload table drop table previewload; preview pipeline using the kafka source create the previewload table with these columns id — non null integer salut — non null string name — non null string surname — non null string zipcode — non null integer age — non null integer rank — non null integer create table previewload ( id int not null, salut varchar(3) not null, name varchar(10) not null, surname varchar(10) not null, zipcode int not null, age int not null, rank int not null); create the preview pipeline test small kafka simple csv specify the ddl csv topic indicate that the kafka consumer should not write its durably made record offsets to the kafka broker by using the write offsets option set to false specify the bootstrap server as servername 0000 and configuration options as "auto offset reset" "earliest" by using the bootstrap servers and config options, respectively limit the returned results to three rows by using the limit option specify the csv extract format and \n record delimiter by using the format and record delimiter extract options, respectively preview pipeline test small kafka simple csv source kafka topic 'ddl csv' write offsets false bootstrap servers 'servername 0000' config '{"auto offset reset" "earliest"}' limit 3 extract format csv record delimiter '\n' into previewload select int($1) as id, char($2) as salut, char($3) as name, char($4) as surname, int($5) as zipcode, int($6) as age, int($7) as rank; id salut name surname zipcode age rank \ 105 mr jmhsuxofspx uaofgayjugb 85573 29 2 101 mr ijmmtbddkyh yqbxqnkgidp 52393 43 1 109 mr bigohpwfwmr qcxgakpkoeu 74420 1 3 fetched 3 rows start pipeline start pipeline begins the execution of the specified data pipeline that extracts data and loads it into the target tables specified by the create pipeline sql statement when you execute the start pipeline sql statement, the ocient system creates a static list of files in the sys pipeline files system catalog table and marks them with the pending status after the system assigns a file to an underlying task, the system marks the file as queued after the system verifies that the file exists, the system marks the file as loading to signify that a loader node has started reading the source data finally, upon successfully loading the file, the system transitions the status of the file to the terminal status loaded a kafka pipeline never enters the completed state in the information schema pipeline status view instead, the pipeline remains running after you start the pipeline until you stop it or the pipeline reaches the specified error limit using the error limit option you must have the execute privilege on the pipeline and the insert privilege on any table that is a target in the pipeline for details, see docid\ asr8r6xqiyofgaz5qnbiw syntax start pipeline pipeline name \[ error \[ limit \<integer value> ] \[ scope pipeline ] \[ file error (fail | skip missing file | tolerate) ] ] \[ using loaders \<loader names> ] \[ on completion (no flush | flush and wait | flush and return) ] parameter data type description pipeline name string the name of the specified data pipeline sql statement options option key default required or optional data type description error limit \<integer value> 0 optional integer error log option that determines the number of record level errors that can occur during the execution of a pipeline that the load tolerates before the whole pipeline execution fails \<integer value> is a number greater than or equal to 1 when you set \<integer value> to 1, the load tolerates an unlimited number of record level errors by default, continuous pipelines tolerate an unlimited number of record level errors, whereas batch pipelines tolerate zero errors error scope \<scope value> pipeline optional string error log option that defines the scope at which the load applies the specified error limit \<scope value> supports the pipeline keyword, which is the scope for the error limit for the whole pipeline when you set this option, if the pipeline reaches the error limit, the database rolls back all data loaded by the pipeline error file error \<error action> fail optional string for pipelines that load data from s3 or local file sources, this error configuration option determines how to treat unrecoverable file level errors examples of unrecoverable file level errors are the file is listed when the pipeline starts but is missing later during the load the gzip file is corrupted and cannot be decompressed the file cannot be downloaded from the source record level error that is not tolerable occurs when tokenizing or transforming data in the file \<error action> can be one of these keywords fail — fail the whole pipeline because of a file level error skip missing file — only tolerate errors that occur due to missing files if a file exists in the list when the pipeline starts but is missing later during the load, skip the file and continue with the next file tolerate — tolerate all unrecoverable file level errors in this mode, the load also tolerates an unlimited number of record level errors the failed , skipped , and loaded with errors file statuses appear in the sys pipeline files system catalog tables, respectively, and indicate how the pipeline handled the file error using loaders loader names none optional list of strings specify one or more names of loader nodes as a comma separated list for executing the start pipeline sql statement if you do not use this option, the ocient system uses all of the loader nodes that are active to execute the pipeline you can find node names in the sys nodes system catalog table on completion \<completion mode> no flush optional string completion type option that specifies the behavior when the pipeline finishes loading this option determines when the remaining pages are converted into segments no flush — do not force a flush of pages rely on watermarks and timeouts to trigger final conversion to segments flush and wait — trigger a flush of pages, initiating final conversion to segments the pipeline blocks and waits for the conversion to segments to complete before marking the pipeline as completed flush and return — trigger a flush of pages, initiating the final conversion to segments the ocient system marks the pipeline as completed immediately following the flush without waiting for conversion to segments to complete for the query to execute successfully, the specified node names must identify nodes that have active operational status streamloader role when you execute the start pipeline sql statement, the ocient system creates a static list of files only for batch pipelines and a dynamic list for continuous pipelines in the sys pipeline files system catalog table examples start an existing pipeline named ad data pipeline with default settings start pipeline ad data pipeline; start an existing pipeline named ad data pipeline with error tolerance (tolerate 10 errors before aborting the pipeline) for details about error tolerance, see docid 833vhisy1bfzcfhusxcyw start pipeline ad data pipeline error limit 10; data pipelines log a message for each pipeline error to the sys pipeline errors system catalog table, even if you do not specify the error option use bad data target settings to capture the original source data start an existing pipeline named ad data pipeline using the loader node named stream loader1 start pipeline ad data pipeline using loaders "stream loader1"; resume a pipeline with file loading in many cases, a file based pipeline stops executing before completion you cannot resume a pipeline in a completed status to resume a pipeline, use the start pipeline sql statement before you resume a pipeline, the status of the pipeline must be created , failed , or stopped when a pipeline resumes, individual files remain in their most recent status as defined in the sys pipeline files system catalog table for batch pipelines, the ocient system does not add new files to the eligible file list when the pipeline resumes if you modify the contents of files during the loading process, the ocient system might experience issues with deduplication that cause duplicated rows or missing data avoid modifying files after you start a pipeline for the first time creating new files on your data source does not impact deduplication logic the start operation groups files using their extractor task id and stream source id identifiers the stream source id uniquely identifies partitions (i e , an ordered list of files), and extractor task id identifies the batch that loads a group of partitions file statuses the ocient system considers files with the statuses loaded , loaded with errors , or skipped to be in the terminal status, whereas other file statuses are still in process completed batches — if all the files in a particular batch have terminal status, then the pipeline does not attempt to reload the batch these files have been completely processed, so the ocient system ignores modifications to these files in process batches — if at least one file in a particular batch does not have terminal status, then the pipeline reloads the entire batch the pipeline reprocesses the in process batches and relies on row deduplication to prevent duplication of rows in the target tables modifications to files in an in process batch can but are not guaranteed to be picked up by a restart modifications to any files in this batch with the loaded , loaded with errors , or skipped statuses might cause issues with deduplication, leading to duplicate or missed data pending files — the ocient system does not assign all pending files to a partition the pipeline attempts to load these files after reloading any in progress batches load duplicate data from files sometimes you might want to load the same data multiple times if you want to load a second copy of the source data, you can either drop and recreate the pipeline to reset the sys pipeline files system catalog table create a second pipeline with a new name and the same configuration when you truncate the target tables and restart the pipeline, the ocient system does not reload the data restart with kafka loading ocient relies on the offset management and consumer group behavior in kafka to deliver exactly once loading semantics and to control the ocient pipeline behavior kafka offsets and consumer group identifiers if you set the write offsets option to true (default value is true ), the kafka consumers commit offsets back to kafka after data is considered durable in the database the kafka broker stores these offsets as the last committed offset for the group identifier group id for each pipeline, the group identifier defaults to \<database name> \<pipeline name> , where the \<database name> is the name of your database and \<pipeline name> is the name of the data pipeline in most use cases, you should not manually change the group id field for a pipeline any kafka pipeline that has the same group id starts consuming from its last committed offset, or if you do not set the value, the pipeline uses the kafka auto offset reset policy to determine where to start for details, see https //docs confluent io/platform/current/clients/consumer html#offset management if you want to start loading from the beginning of a topic, configure an unused group id field (or use a group id field that did not commit any of its offsets back) and ensure the auto offset reset kafka configuration is appropriately set in the config option kafka pipeline deduplication the committed offset of a kafka partition lags slightly behind the rows that have been loaded into ocient these lags do not cause an issue with data duplication if you stop a pipeline before it can commit its most recent durable offset to kafka, restarting the same pipeline starts loading from the last committed offset however, the database deduplicates records sent twice for the same pipeline ocient deduplicates kafka data for the specified combination of pipeline identifier, kafka topic, and the kafka partition number while the consumer group offsets manage where the pipeline resumes loading, the ocient system enforces the exactly once loading of a kafka partition only if you stop or restart a pipeline with the same pipeline identifier if you drop a pipeline and create a new one with the same name, the ocient system creates a new pipeline identifier this action does not deduplicate data against data loaded in the original pipeline to preserve deduplication, instead of dropping the pipeline, use the create or replace pipeline sql statement with the original pipeline name and the pipeline correctly deduplicates against the original data do not run multiple pipelines concurrently with the same consumer group identifier this action leads to unpredictable data duplication if you want to increase the number of consumers that read from a kafka topic, increase the value of the cores parameter load duplicate data on kafka sometimes you might want to load the same data multiple times if you want to load a second copy of the source data from kafka, you can either drop the pipeline, recreate it with the same name, and reset the consumer group offsets manually create a new pipeline with a different name and load from the beginning of the topic pipeline database dependency each pipeline belongs to a database you cannot drop a database that has a running pipeline to drop a database, ensure that all pipelines in the database are in a non running status pipeline table dependency each pipeline has a target table you cannot drop a table that has a running pipeline to drop a table, ensure that all pipelines that are loading data into the table are in a non running status stop pipeline stop pipeline stops the execution of the pipeline and its associated tasks after you stop a pipeline, you can execute the start pipeline sql statement on the pipeline to run the pipeline again regardless, the load deduplicates any records previously loaded in the same pipeline you must have the execute privilege on the pipeline to execute this sql statement for details, see docid\ asr8r6xqiyofgaz5qnbiw syntax stop pipeline pipeline name parameter data type description pipeline name string the name of the specified data pipeline example stop an existing pipeline named ad data pipeline stop pipeline ad data pipeline; you can see the status of the parent tasks in the sys tasks system catalog table and see the status of the child tasks in the sys subtasks system catalog table alter pipeline rename alter pipeline rename to sql statement changes the name of the pipeline object, while retaining its identifier, options, and other metadata the ocient system reflects this change in the sys pipelines system catalog table then, you must use the new name when you refer to the pipeline in sql statements you must have the alter privilege on the pipeline to execute this sql statement for details, see docid\ asr8r6xqiyofgaz5qnbiw syntax alter pipeline \[ if exists ] pipeline original name rename to pipeline new name parameter data type description pipeline original name string the name of the existing data pipeline pipeline new name string the new name of the data pipeline example rename an existing pipeline named ad data pipeline to renamed pipeline alter pipeline ad data pipeline rename to renamed pipeline; export pipeline export pipeline returns the create pipeline sql statement used to create the pipeline object you can use the output of this statement to recreate an identical pipeline when you remove the original pipeline the execution of this statement censors sensitive s3 values like access key id and secret access key and kafka https //docs confluent io/platform/current/installation/configuration/consumer configs html password type fields the database replaces them with to execute this statement, you must have the view privilege on the pipeline and any table the pipeline targets for details, see docid\ asr8r6xqiyofgaz5qnbiw syntax export pipeline pipeline name parameter data type description pipeline name string the name of the specified data pipeline example export an existing pipeline in the database ad data pipeline export pipeline ad data pipeline; create pipeline function create pipeline function enables you to define a function for loading data define the function behavior using the {{groovy}} language for details about this language, see https //groovy lang org/index html function arguments and output are strongly typed and immutable you can test the execution of your function using the preview pipeline sql statement the ocient system does not support the overload of function names syntax create \[ or replace ] pipeline function \[ if not exists ] function name( input argument \[, ] ) language groovy returns output argument definition imports \[ library name \[, ] ] as $$ groovy declaration $$ parameter type description function name string a unique identifier for the data pipeline function input argument string the name of one or more input arguments of the function specify data types for input arguments according to the support data types defined in docid 7s5nztl8lolpwt2pcnjyz for the data type declaration, use not null where applicable for maximum performance output argument definition string the type definition of the output from the function library name string the name of one or more {{java}} libraries you can include libraries by using the imports clause or specifying the fully qualified class (e g , java lang integer ) path in the source definition groovy declaration string the groovy definition of the function install and enable third party libraries you can use the default list of supported third party libraries or additional third party libraries that you install supported libraries data pipeline functions can import classes from the default list of supported third party libraries this table provides the resources for each supported library package library package resource java lang https //docs oracle com/javase/8/docs/api/java/lang/package summary html java util https //docs oracle com/javase/8/docs/api/java/util/package summary html java nio bytebuffer https //docs oracle com/javase/8/docs/api/java/nio/bytebuffer html groovy json https //docs groovy lang org/latest/html/gapi/groovy/json/package summary html groovy xml https //docs groovy lang org/latest/html/gapi/groovy/xml/package summary html groovy yaml https //docs groovy lang org/latest/html/api/groovy/yaml/package summary html org apache groovy datetime extensions https //docs groovy lang org/latest/html/api/org/apache/groovy/datetime/extensions/package summary html org apache groovy dateutil https //docs groovy lang org/latest/html/api/org/apache/groovy/dateutil/extensions/package summary html com ocient streaming data types docid\ vk8kyybwfton5ax4e q1i additional libraries you can install and enable additional third party libraries to import for use in your data pipeline functions you must install the jar package on all loader nodes in the /opt/ocient/current/lib/extractorengine udt folder then, add the fully qualified class name in the function import list as part of the library name parameter for example, to reference the bytebuffer class from the com fastbuffer package, specify com fastbuffer bytebuffer in the library name parameter and use the class in the groovy definition as var x = new com fastbuffer bytebuffer() groovy data type mapping for the groovy definition, the ocient system maps its sql data type to the corresponding groovy data type your groovy code should use the groovy data type defined in this table for any input arguments and output sql data type groovy data type bigint java lang long binary(n) or hash(n) byte\[] boolean java lang boolean char(n) or varchar(n) java lang string date java time localdate decimal(p,s) com ocient streaming data types decimal double java lang double int java lang integer ipv4 java net inet4address ip java net inet6address st point com ocient streaming data types gis stpoint st linestring com ocient streaming data types gis stlinestring st polygon com ocient streaming data types gis stpolygon float java lang float smallint java lang short time com ocient streaming data types time timestamp com ocient streaming data types timestamp byte java lang byte tuple<\<type1, type2, …>> com ocient streaming data types ocienttuple type\[] java util list\<type> uuid java util uuid varbinary(n) byte\[] varchar(n) java lang string example create the sort function data pipeline function to sort an array of integers the function has two input arguments value , a non null array of integers, and ascending , the sort order the function returns a non null array of integers the function imports these java libraries java lang integer java util arraylist java util collections java util comparator java util list define the groovy code because the input arguments do not change, the example groovy code first copies the value argument, sorts the copied list according to the sort order, and returns the sorted array create pipeline function sort function( value int\[] not null, ascending boolean not null) language groovy returns int\[] not null imports \[ "java lang integer", "java util arraylist", "java util collections", "java util comparator", "java util list" ] as $$ / make a copy of the list / var sorted = new arraylist\<integer>((list\<integer>)value); / sort the array elements according to the specified order / collections sort( sorted, ascending ? comparator naturalorder() comparator reversed()); / return the sorted array / return sorted; $$; view the creation information about the sort function function using the sys pipeline functions system catalog table this statement returns the function name, return type, argument names, data types of the arguments, and the list of imported libraries select name, return type, argument names, argument types, imported libraries from sys pipeline functions; drop pipeline function drop pipeline function removes an existing pipeline function you must have the drop privilege on the pipeline function to execute this sql statement for details, see docid\ asr8r6xqiyofgaz5qnbiw syntax drop pipeline function \[ if exists ] function name \[, ] parameter data type description function name string the name of the specified data pipeline function to remove you can drop multiple pipelines by specifying additional function names and separating each with commas examples remove the existing pipeline function remove an existing pipeline function named sort function drop pipeline function sort function; remove an existing pipeline function by checking for existence remove an existing pipeline function named sort function or return a warning if the ocient system does not find the function in the database drop pipeline function if exists sort function; related links docid\ xq0tg7yph vn62uwufibu docid 5xxtimfhjnlyxs48wxsxs docid\ yqk wibdyxiq8dxewhxhf docid\ aimcafoydn2xf fgqssys docid\ a3n4wkcawrpo1gtefetmm docid\ asr8r6xqiyofgaz5qnbiw docid 6gxqsgtokm p3roskqmam