Load Data

Data Formats for Data Pipelines

loading in {{ocient}} differs in subtle ways that depend on the data format of the source loading runs with strict interpretation of source data to allow pipelines to achieve maximum performance for text based formats like json and delimited , the ocient system performs no preemptive casting on the data when using a source field selector for example, with the json string { "my field" 1234 } , the selector $my field returns the string "1234" not the integer 1234 when you use transformation functions, keep in mind that the ocient system treats all data in the json and delimited formats as text data while the ocient system sends data you select to a final target column, the system automatically casts the data in the final step to ensure that the data is compatible with the target column type see docid 7s5nztl8lolpwt2pcnjyz for supported automatic conversion rules format specific differences also appear in the pipelines load delimited and csv data when you load data from delimited or csv files, the ocient system tokenizes the data during loading the system detects records and fields in the input data during pipeline execution you can reference fields and use them in combination with transformation functions before the system stores values in the column of a target table referencing fields of the source data for the formats happens by using a field index the index is a number that follows the dollar sign $ to maintain consistency with sql array semantics, the field indexes start at 1 reference the first field of tokenized source records for the delimited and csv formats as $1 for the binary format, $0 represents the entire record in this case, you must specify $0 in combination with the substring function to extract specific bytes from the source data for a complete list of supported options for delimited and csv data format, see docid\ l8tdfpfzzvzeyabc2h7bq delimited loading example use this example delimited data iphone|60607|viewed|502|293 99|\[shopping,news] for this example row, this table shows the field references for each value in the row field reference value $1 iphone $2 60607 $3 viewed $4 502 $5 293 99 $6\[] \['shopping','news'] to load this data in a pipeline with the delimited data format, this create pipeline statement specifies the | character for the field delimiter this statement loads data into {{aws}} s3 the select statement uses fields 1, 2, 3, 5, and 6 of the source data the statement specifies that the system should not load field 4 to the target table field 6 is an array of data matching the default array settings for delimited data you can indicate this with the array brackets like $6\[] to load into a char\[] typed column the outer casting functions in this example are optional and shown for completeness if they are omitted, the pipeline automatically casts the source fields to the target column type create pipeline source s3 format delimited field delimiters \['|'] select char($1) as device model, int($2) as zip, int($3) as amount, double($5) as price, char\[]\($6\[]) as categories load json data the data pipeline syntax enables the load of json data, including nested scalars, arrays, and points st point strict loading and transformations when you use transformation functions, remember that the ocient system treats all data in json and delimited format as text data, not the logical data type for example, if you specify the json string { "my timestamp" 1709208000000 } , the selector $my timestamp returns the string "1709208000000" and not the integer 1709208000000 as a result, if you cast this data into a timestamp column, such as timestamp($my timestamp) as created at , the ocient system returns an error the conversion fails because the cast function assumes you are specifying timestamp(varchar) , which assumes a format like yyyy mm dd hh🇲🇲ss\[ sssssssss] to correct this issue, cast the value explicitly to make use of the timestamp(bigint) function that treats the argument as milliseconds after the epoch as in timestamp(bigint($my timestamp)) as created at supported json selectors json selectors consist of $ followed by a dot separated list of json keys if a key refers to an array, it is followed by a set of brackets \[] to correspond to its dimensionality if the square brackets contain an index, like \[1] , then the selector refers to an array element the ocient system treats json selectors as lowercase to use case sensitive selectors, you must enclose the selector in double quotation marks for example, $"testselector" with case sensitive selectors having multiple json keys, each key needs double quotation marks for example, $"testdata" "responses" "successresponse" for special characters (any identifier that starts with any character other than a letter or contains any character that is not a letter, number, or an underscore) or reserved sql keywords (such as select ), you must enclose such selectors in double quotation marks for example, if you have a json document { "test field" 123 } , then the selector for the query should be $"test field" the ocient system does not support identifiers with a backslash as the last character in the key name this table shows the selector and provides its description the cells in the last column of the table show an example for each selector first, the cell shows example data in json format then, the cell shows the use of the selector and its output after the arrow selector description examples $a scalar — refers to key a at the top of a json document, where the value of a is a scalar { "first name" "chris" } $first name > "chris" $a\[] array — refers to key a at the top of a json document, where the value of a is a one dimensional array { "names" \["chris", "joe"] } $names\[] > \["chris","joe"] $a\[1] array element selection — selects a single element from the array a at the top of a json document, where the value of a is a one dimensional array also works on multi dimensional arrays indexes start at 1 { "names" \["chris", "joe"] } $names\[2] > "joe" $a\[]\[] multi dimensional array — refers to key a at the top of a json document, where the value of a is a two dimensional array { "coordinates" \[ \[87 31, 18 25],\[65 22, 19 41]] } $coordinates\[]\[] > \[\[87 31,18 25],\[65 22,19 41]] $a b c nested selector — drills into the value at key a , then value at key b , and refers to the value at key c , where none of the values are arrays { "user" { "name" "chris" } } $user name > "chris" $a\[] b array projection — a projection applies a json selector on all elements in an array, returning an array array brackets in the notation indicate which items are the arrays the a\[] is an array the b selector is applied to each element in the array a { "users" \[ { "name" "chris", "orders" \[ { "order id" 1, "subtotal" 19 29 }, { "order id" 2, "subtotal" 16 71 } ] }, { "name" "joe", "orders" \[ { "order id" 3, "subtotal" 17 29 }, { "order id" 4, "subtotal" 22 49 } ] } ] } $users\[] name → \["chris", "joe"] $a b\[] c d\[] multi level array projection — applies selectors to each element in the specified arrays and allows multiple levels of objects and arrays for the selection of an n dimensional array of values the values of b and d are arrays, resulting in two dimensional arrays the selector c d\[] is projected into the elements in the array b { "users" \[ { "name" "chris", "orders" \[ { "order id" 1, "subtotal" 19 29 }, { "order id" 2, "subtotal" 16 71 } ] }, { "name" "joe", "orders" \[ { "order id" 3, "subtotal" 17 29 }, { "order id" 4, "subtotal" 22 49 } ] } ] } $users\[] orders\[] order id > \[\[1,2],\[3,4]] $a b\[1] c d\[] array slice — selects the first element from the array b and drills into c and then the array d json array selectors have an index that starts at 0 { "users" \[ { "name" "chris", "orders" \[ { "order id" 1, "subtotal" 19 29 }, { "order id" 2, "subtotal" 16 71 } ] }, { "name" "joe", "orders" \[ { "order id" 3, "subtotal" 17 29 }, { "order id" 4, "subtotal" 22 49 } ] } ] } $users\[1] orders\[] subtotal > \[17 29,22 49] $a {b,c} tuple selector — selects the values from an object as a tuple you must cast the values to an appropriate tuple data type { "user" { "first name" "john", "last name" "doe" }} $user {first name,last name} > <<"john","doe">> $a\[] {b,c} array of tuples selector — selects the values from an array of objects as an array of tuples you must cast the values to an appropriate tuple\[] data type { "users" \[ { "first name" "john", "last name" "doe" }, { "first name" "steve", "last name" "smith" } ]} $users\[] {first name,last name} > \[<<"john","doe">>,<<"steve","smith">>] $a\[ ] flatten array selector — selects the values of an array and flattens the dimension indicated by the underscore by one level { a { b \[ {c \[1,2,3] }, {c \[4,5,6] } ] } } $a b\[ ] c > \[1,2,3,4,5,6] $a\[!] compact array selector — selects the values of an array and removes any null values from the array at this level in the array { a { b { c \[1,null,2,null,3] } } } $a b c\[!] > \[1,2,3] for more examples of using json selectors in data pipelines, see docid\ mpei9dd8ew0rn1f9iw2ne null and empty handling for json scalars the ocient system handles all json null, empty, and missing values in the same way the system loads these values as null these values fail to load into non nullable columns provide an explicit default in the pipeline using if null or coalesce or use the column default if null option to accept the configured column default instead of attempting to load null values null and empty handling for json arrays the ocient system handles null, empty, and missing values the same way for arrays as for scalars the system converts a value that is null, empty, or missing to null and loads it as null provide an explicit default in the pipeline or use the column default if null option to accept the configured column default instead of attempting to load null null and empty handling for json tuples all the rules for handling null, empty, and missing elements that apply to scalars and arrays also apply to tuples if any part of the selector is null, empty, or missing, data pipeline loading converts that value to null additionally, because you can apply functions to tuple elements (and not array elements), you can use the null if function to convert a tuple element to null for example, tuple<\<char,varchar>>($a name, null if($a hometown, 'n/a') ) indicates to the pipeline that the string 'n/a' signifies null for the hometown element but not for the name element load binary data the ocient system loads the binary data format using a fixed record length to split a binary stream into chunks that represent records each record is available in the select portion of a pipeline definition using a special binary extract syntax $"\[5,8]" this operates similarly to a substring function, beginning at byte 5 and taking 8 bytes from that location the starting index is a 1 based offset, consistent with other sql arrays and offsets you can use this syntax to select specific bytes within a record to parse together as a unit binary selector selector description examples $"\[\<start index>,\<length in bytes>]" extract bytes — extracts the bytes beginning at the \<start index> and returning \<length in bytes> total bytes to extract from the start index \<start index> is 1 based abc123 45xyz $"\[4,6]" > "123 45" this example illustrates the operation using ascii characters, but the binary selector returns binary data, byte arrays, that may be used in special ways by fixed width processing functions described in docid\ aimcafoydn2xf fgqssys example the binary selector takes 8 bytes starting at offset 11 on the fixed width binary record consistent with sql functions in the ocient system, the first argument value 11 is the 1 based offset into the byte array sql $"\[11, 8]" the binary selector returns binary data, not varchar special binary transformation functions can operate on this binary data however, if you cast data to the varchar type by using char() , then functions like int operate on this data as varchar data, not binary data when you load binary data into varchar columns, the ocient system automatically converts from binary to character data using the configured charset name before final loading the ocient system supports special transformation functions that operate uniquely on binary data with these functions, you can convert binary representations from mainframe systems such as packed decimals, zoned decimals, big and little endian integers (signed and unsigned), and floating point values for more details, see docid\ aimcafoydn2xf fgqssys for a complete list of supported options for delimited and csv data formats, see docid\ l8tdfpfzzvzeyabc2h7bq binary loading example if each record in your fixed width binary schema includes these fields, you can use the substring function and the transforms shown in this example column start index record length source data type ocient target data type first name 1 20 character varchar last name 21 20 character varchar age 41 4 big endian signed 4 byte integer int total spent 45 10 packed decimal decimal(10,2) user id 55 8 little endian unsigned 8 byte long bigint each record includes 62 bytes, so the record length record length is 62 the encoding of this file is cp500 instead of the default ibm1047 code page the create pipeline sql statement specifies this encoding create pipeline binary users pipeline source 	s3 format 	binary 	record length 62 	charset name 'cp500' into public users select 	$"\[1, 20]" as first name, 	$"\[21, 20]" as last name, 	int($"\[41, 4]") as age, 	decimal($"\[45, 10]", 'packed', 2) as total spent, 	bigint($"\[55, 8]", 'unsigned', 'little') as user id; this sql statement uses the binary selector to extract names and load them into the respective columns the ocient system automatically decodes the values using cp500 and loads them into a varchar column an explicit cast such as char($"\[1, 20]") as first name works equivalently indicates the extraction of four bytes that represent age from bytes 41 44 the statement instructs the casting of these bytes as an integer int this function uses the default endianness (big) and treats the bytes as signed unsigned values can overflow target columns because integral types are all signed extracts the 10 bytes for total spent using the binary selector , and converts the values using the packed decimal option for the decimal cast the casting requires specifying the number of decimal points in the source data in this case, there are 2 decimal points, which matches the target column extracts the 8 bytes that represent user id using the binary selector and casts these bytes to a bigint while interpreting the bytes as unsigned with the little endian representation load parquet data the data pipeline functionality enables loading {{parquet}} files with this configuration file configuration files should have row groups of less than 128 mb larger row groups can impact memory usage during loading, and row groups of 512 mb can cause loading failures on 1 tb or more data sets encoding fields in a parquet file reduces the space of the file on disk but can impact memory usage during loading enable encoding on fields that you expect to have less than 256 unique values and for fields that contain short strings you do not have to encode other fields multiple files you can load row groups of multiple parquet files in parallel for large data sets, load the data set as multiple files loading files with differing schemas is not supported use selectors as you do when loading json data to specify data to load you must select a leaf element, an array, or a tuple with your selector this is stricter than using json selectors, which can directly select array fields and json object fields example {"a" \[1,2,3], "b" {"c" 1}} you can extract with any of the selectors in json $a, $a\[], $b, $b c however, parquet only allows for the selectors $a\[], $b c this example assumes this schema // list\<string> (list non null, elements nullable) required group my list (list) { repeated group list { optional binary element (utf8); } } the selector must be $my list\[] , which includes the array syntax for details, see docid\ rynpziueddjp 3h0tt9ri when you use the format parquet option with an aws s3 source, the endpoint option is required in the create pipeline sql statement auto casting in parquet does not support the automatic conversion to varchar columns you must explicitly cast data to the char data type when you convert parquet data that is not string data to a varchar column or varchar function argument schema evolution the ocient system supports schema evolution when you load a set of parquet files specifically, if the pipeline selects a set of parquet files where an individual file might have more or fewer columns than another, the system attempts to merge those schemas together to support loading without requiring you to create the pipeline again for example, the test table table has three columns create table test table ( col a int null, col b varchar null, col c varchar null ); you have two parquet files with these schemas message file1 schema { optional int32 col a; optional byte array col b (utf8); } message file2 schema { optional byte array col b (utf8); optional byte array col c (utf8); } however, you must specify how you want to handle the schema evolution within the extract sql statement you can choose to sample the first file only for its schema or sample the entire data set to merge the schemas together this ddl statement samples on one file extract format parquet schema (infer from sample file) the disadvantage is that sampling multiple files can potentially take a long time (scaling with the number of files in the data set) when you execute the create pipeline and start pipeline sql statements if you know that all of the parquet files have the same schema, use this syntax the ocient system does not support the case where a column within the schema changes type for example, if col a is an int type in one file and a varchar type in another the default behavior of schema evolution is inferring the schema from all files use this syntax to infer from all files extract format parquet schema (infer from all files) supported data types in parquet parquet data types are separated into primitive and logical types the ocient system converts these types to ocient sql types see these tables for the respective conversions parquet primitive type ocient sql type boolean boolean int32 int int64 bigint int96 timestamp float float double double byte array varchar fixed len byte array varchar parquet logical types ocient sql type string varchar utf8 varchar enum varchar uuid uuid int8 tinyint int16 smallint int32 int int64 bigint uint8 smallint uint16 int uint32 bigint uint64 bigint decimal decimal date date time time time millis time time micros time timestamp timestamp timestamp millis timestamp timestamp micros timestamp duration bigint json varchar bson varchar the interval data type is not supported the uint64 data type can overflow the bigint conversion the duration data type conversion to bigint preserves the underlying units for example, the number of microseconds stays as microseconds in the bigint data type further, parquet contains nested types that the ocient system also converts to sql types, as shown in this table parquet nested types ocient sql type list type\[] tuple tuple parquet partition keys with parquet, you can load partition keys from a partitioned parquet file using the file path structure use the filename metadata selector along with transformation functions assume files with these file paths s3 //aaa bbb ccc/bigdata/eee/abc/def/dt=2021 11 24/file parquet /dt=2021 11 25/file parquet /dt=2021 11 26/file parquet construct the selector using this code to create a date field to date(substring(metadata('filename'), 49, 10)) as partition key date column for details about transformation functions, see docid\ aimcafoydn2xf fgqssys parquet example create a data pipeline that loads parquet files using an aws s3 bucket specify the bucket, endpoint, access key identifier, secret access key, and filter options to find all parquet files in the specified file path use the parquet base table table to store the loaded data retrieve integer, text, floating point, double, integer, json, and bson fields create pipeline testpipeline source s3 bucket 'testbucket' endpoint 'https //endpoint' access key id '' secret access key '' filter glob '/data/ /2024/ parquet' extract format parquet into parquet base table select $int32 field as int32 field, $utf8 field as utf8 field, $float field as float field, $double field as double field, $int64 field as int64 field, $json field as json field, $bson field as bson field; related links docid\ mpei9dd8ew0rn1f9iw2ne docid\ l8tdfpfzzvzeyabc2h7bq docid\ a3n4wkcawrpo1gtefetmm