Load Data
Data Formats for Data Pipelines
loading in {{ocient}} differs in subtle ways that depend on the data format of the source loading runs with a strict interpretation of source data to allow pipelines to achieve maximum performance for text based formats like json and delimited , the ocient system performs no preemptive casting on the data when using a source field selector for example, with the json string { "my field" 1234 } , the selector $my field returns the string "1234" not the integer 1234 when you use transformation functions, keep in mind that the ocient system treats all data in the json and delimited formats as text data while the ocient system sends data you select to a final target column, the system automatically casts the data in the final step to ensure that the data is compatible with the target column type see docid\ jfqu osagg5enkvmeesnl for supported automatic conversion rules format specific differences also appear in the pipelines load asn 1 data you can load data in asn 1 (abstract syntax notation one) format from binary encoded asn 1 files using der encoded or ber encoded files asn 1 provides a flexible, schema driven format commonly used in telecommunications, security, and standardized protocols this format allows you to extract structured records and map them to relational tables using sql the ocient system requires ber and der files to contain one or more concatenated der encoded or ber encoded values with the specified record type the system decodes each record independently and maps it into a record asn 1 type mapping the system automatically converts all decoded asn 1 fields to their json equivalent representations the asn 1 schema must consistently use implicit or explicit tagging you must specify clear tagging so the system can resolve field names during extraction true 193,193,194 unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type to access fields, use dot notation for access $sequence fieldname for arrays (such as sequence of ), use bracket notation $sequenceof\[] fieldelement default values and optional fields follow standard asn 1 rules if you omit a field, the system evaluates it as null asn 1 loading example assume you have the personnel records asn file in the asn 1 data format the file contains the definition of a personnel record example personnelrecord the asn 1 file contains data for the personnel record, child, personnel name, employee number of the personnel, and the date example definitions implicit tags = begin personnelrecord = \[application 0] set { 	 name \[0] name, 	 title \[1] visiblestring, 	 number \[2] employeenumber, 	 dateofhire \[3] date, 	 nameofspouse \[4] name, 	 children \[5] sequence of childinformation default {} } childinformation = set { 	 name \[0] name, 	 dateofbirth \[1] date } name = \[application 1] sequence { 	 givenname \[0] visiblestring, 	 initial \[1] visiblestring, 	 familyname \[2] visiblestring } employeenumber = \[application 2] integer date = \[application 3] visiblestring end create a table to contain the personnel record the table contains a subset of the data first name — first name initial — initial of the middle name family name — last name title — job title number — employee number date of hire — hire date create table personnel records( first name varchar not null, initial varchar not null, family name varchar not null, title varchar not null, number bigint not null, date of hire date not null ); create the data pipeline personnel pipeline to load the personnel record into the personnel records table using an {{aws}} s3 bucket specify the bucket, endpoint, access key identifier, secret access key, and filter options to find the asn 1 der encoded file personnel records der in the specified file path use the url file path http //cos/filepath/asn1/personnel records asn and record type example personnelrecord access the name sequence using dot notation for the first name, middle initial, and last name fields the pipeline definition transforms the hire date to the 'yyyymmdd' format create pipeline personnel pipeline source s3 bucket 'misc' endpoint 'http //cos' access key id '' secret access key '' filter glob 'user/asn1/personnel records der' extract format 'asn 1' schema { url 'http //cos/filepath/asn1/personnel records asn' record type 'example personnelrecord' } into personnel records select $name givenname as first name, $name initial as initial, $name familyname as family name, $title as title, $number as number, to date($dateofhire, 'yyyymmdd') as date of hire; load avro data the ocient system enables you to load data in the {{avro}} format you can use a file based source only load an inline schema definition or use a schema configuration use a schema inference from files with embedded schemas field selectors in avro follow the same format as selectors in json and {{parquet}} formats inline schema specify a json string in the avro schema format in the inline option of the schema definition of the create pipeline sql statement inline schemas assume that all records follow the defined schema exactly these schemas do not support schema evolution schema inference from a file the system can infer the schema from a file that has embedded schemas the file is a named object container file use the infer from option in the create pipeline sql statement to specify sampling one file schema evolution when you create a data pipeline, the pipeline has a fixed target schema (specified by the select clause) individual files might have different schemas the target schema must be backward transitive compatible with the other schemas if the other schemas change, the system automatically attempts to fit data into the target schema in this case, the other schemas must be forward compatible with the target schema the system ignores changes to any unused fields from the target schema multiple schemas impact the performance of the data pipeline execution for best performance, use a single schema for all data avro type mapping the ocient system converts these avro data types to ocient sql types this table shows the respective conversions true 290,290left unhandled content type left unhandled content type left 1 1 unhandled content type left 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type 1 1 unhandled content type avro loading examples create the users table with these columns id — universally unique identifier (uuid) of the user firstname — first name of the user lastname — last name of the user birthyear — year of birth groups — list of groups where the user belongs create table users( id uuid not null, firstname varchar(255) not null, lastname varchar(255) not null, birthyear int, groups varchar(255)\[] not null default 'char\[]' ); these examples use this table as the target table for the load load avro data from files assume you have user data in avro format in multiple files in the /data/users directory create the users pipeline data pipeline for the avro files avro containing user data the schema configuration instructs the system to infer from one file using the infer from option access the array of strings for the groups column create pipeline users pipeline source filesystem filter '/data/users/ avro' extract format avro schema { infer from 'sample file' } into users select $id as id, $firstname as firstname, $lastname as lastname, $birthyear as birthyear, $groups\[] as groups; load avro data from an inline schema definition assume you have user data in avro format in multiple files in the /data/users directory create the users pipeline data pipeline for the avro files avro containing user data the schema configuration instructs the system to use an inline schema definition with the inline option access the array of strings for the groups column create pipeline users pipeline source filesystem filter '/data/users/ avro' extract format avro schema { inline '{ "type" "record", "name" "user", "namespace" "test users", "fields" \[ { "name" "id", "type" { "type" "string", "logicaltype" "uuid" } }, { "name" "firstname", "type" "string" }, { "name" "lastname", "type" "string" }, { "name" "birthyear", "type" \["null","int"], "default" null }, { "name" "groups", "type" { "type" "array", "items" "string" } } ] }' } into users select $id as id, $firstname as firstname, $lastname as lastname, $birthyear as birthyear, $groups\[] as groups; load binary data the ocient system loads the binary data format using a fixed record length to split a binary stream into chunks that represent records each record is available in the select portion of a pipeline definition using a special binary extract syntax $"\[5,8]" this operates similarly to a substring function, beginning at byte 5 and taking 8 bytes from that location the starting index is a 1 based offset, consistent with other sql arrays and offsets you can use this syntax to select specific bytes within a record to parse together as a unit binary selector true 193,193,194left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type example the binary selector takes 8 bytes starting at offset 11 on the fixed width binary record consistent with sql functions in the ocient system, the first argument value 11 is the 1 based offset into the byte array sql $"\[11, 8]" the binary selector returns binary data, not varchar special binary transformation functions can operate on this binary data however, if you cast data to the varchar type by using char() , then functions like int operate on this data as varchar data, not binary data when you load binary data into varchar columns, the ocient system automatically converts from binary to character data using the configured charset name before final loading the ocient system supports special transformation functions that operate uniquely on binary data with these functions, you can convert binary representations from mainframe systems such as packed decimals, zoned decimals, big and little endian integers (signed and unsigned), and floating point values for more details, see docid\ ti3mdibvgmuudmlqu9xpl for a complete list of supported options for delimited and csv data formats, see docid\ pbyszqvu5wonpgoso qto binary loading example if each record in your fixed width binary schema includes these fields, you can use the substring function and the transforms shown in this example true 116,116,116,116,116left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type each record includes 62 bytes, so the record length record length is 62 the encoding of this file is cp500 instead of the default ibm1047 code page the create pipeline sql statement specifies this encoding create pipeline binary users pipeline source 	s3 format 	binary 	record length 62 	charset name 'cp500' into public users select 	$"\[1, 20]" as first name, 	$"\[21, 20]" as last name, 	int($"\[41, 4]") as age, 	decimal($"\[45, 10]", 'packed', 2) as total spent, 	bigint($"\[55, 8]", 'unsigned', 'little') as user id; this sql statement uses the binary selector to extract names and load them into the respective columns the ocient system automatically decodes the values using cp500 and loads them into a varchar column an explicit cast such as char($"\[1, 20]") as first name works equivalently indicates the extraction of four bytes that represent age from bytes 41 44 the statement instructs the casting of these bytes as an integer int this function uses the default endianness (big) and treats the bytes as signed unsigned values can overflow target columns because integral types are all signed extracts the 10 bytes for total spent using the binary selector , and converts the values using the packed decimal option for the decimal cast the casting requires specifying the number of decimal points in the source data in this case, there are 2 decimal points, which match the number in the target column extracts the 8 bytes that represent user id using the binary selector and casts these bytes to a bigint while interpreting the bytes as unsigned with the little endian representation load delimited and csv data when you load data from delimited or csv files, the ocient system tokenizes the data during loading the system detects records and fields in the input data during pipeline execution you can reference fields and use them in combination with transformation functions before the system stores values in the column of a target table referencing fields of the source data for the formats happens by using a field index the index is a number that follows the dollar sign $ to maintain consistency with sql array semantics, the field indexes start at 1 reference the first field of tokenized source records for the delimited and csv formats as $1 for the binary format, $0 represents the entire record in this case, you must specify $0 in combination with the substring function to extract specific bytes from the source data for a complete list of supported options for delimited and csv data format, see docid\ pbyszqvu5wonpgoso qto delimited loading example use this example delimited data iphone|60607|viewed|502|293 99|\[shopping,news] for this example row, this table shows the field references for each value in the row true 290,290left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type to load this data in a pipeline with the delimited data format, this create pipeline statement specifies the | character for the field delimiter this statement loads data into aws s3 the select statement uses fields 1, 2, 3, 5, and 6 of the source data the statement specifies that the system should not load field 4 to the target table field 6 is an array of data matching the default array settings for delimited data you can indicate this with the array brackets like $6\[] to load into a char\[] typed column the outer casting functions in this example are optional and shown for completeness if they are omitted, the pipeline automatically casts the source fields to the target column type create pipeline delimited pipeline source s3 format delimited field delimiters \['|'] select char($1) as device model, int($2) as zip, int($3) as amount, double($5) as price, char\[]\($6\[]) as categories; load json data the data pipeline syntax enables the load of json data, including nested scalars, arrays, and points st point strict loading and transformations when you use transformation functions, remember that the ocient system treats all data in json and delimited format as text data, not the logical data type for example, if you specify the json string { "my timestamp" 1709208000000 } , the selector $my timestamp returns the string "1709208000000" and not the integer 1709208000000 as a result, if you cast this data into a timestamp column, such as timestamp($my timestamp) as created at , the ocient system returns an error the conversion fails because the cast function assumes you are specifying timestamp(varchar) , which assumes a format like yyyy mm dd hh🇲🇲ss\[ sssssssss] to correct this issue, cast the value explicitly to make use of the timestamp(bigint) function that treats the argument as milliseconds after the epoch as in timestamp(bigint($my timestamp)) as created at supported json selectors json selectors consist of $ followed by a dot separated list of json keys if a key refers to an array, it is followed by a set of brackets \[] to correspond to its dimensionality if the square brackets contain an index, like \[1] , then the selector refers to an array element the ocient system treats json selectors as lowercase to use case sensitive selectors, you must enclose the selector in double quotation marks for example, $"testselector" with case sensitive selectors having multiple json keys, each key needs double quotation marks for example, $"testdata" "responses" "successresponse" for special characters (any identifier that starts with any character other than a letter or contains any character that is not a letter, number, or an underscore) or reserved sql keywords (such as select ), you must enclose such selectors in double quotation marks for example, if you have a json document { "test field" 123 } , then the selector for the query should be $"test field" the ocient system does not support identifiers with a backslash as the last character in the key name this table shows the selector and provides its description the cells in the last column of the table show an example for each selector first, the cell shows example data in json format then, the cell shows the use of the selector and its output after the arrow true 193,193,194left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type left unhandled content type for more examples of using json selectors in data pipelines, see docid\ hlhxjwwhmbdtuouffmzkl null and empty handling for json scalars the ocient system handles all json null, empty, and missing values in the same way the system loads these values as null these values fail to load into non nullable columns provide an explicit default in the pipeline using if null or coalesce or use the column default if null option to accept the configured column default instead of attempting to load null values null and empty handling for json arrays the ocient system handles null, empty, and missing values the same way for arrays as for scalars the system converts a value that is null, empty, or missing to null and loads it as null provide an explicit default in the pipeline or use the column default if null option to accept the configured column default instead of attempting to load null null and empty handling for json tuples all the rules for handling null, empty, and missing elements that apply to scalars and arrays also apply to tuples if any part of the selector is null, empty, or missing, data pipeline loading converts that value to null additionally, because you can apply functions to tuple elements (and not array elements), you can use the null if function to convert a tuple element to null for example, tuple<\<char,varchar>>($a name, null if($a hometown, 'n/a') ) indicates to the pipeline that the string 'n/a' signifies null for the hometown element but not for the name element load parquet data the data pipeline functionality enables loading parquet files with this configuration file configuration files should have row groups of less than 128 mb larger row groups can impact memory usage during loading, and row groups of 512 mb can cause loading failures on 1 tb or more data sets encoding fields in a parquet file reduces the space of the file on disk but can impact memory usage during loading enable encoding on fields that you expect to have less than 256 unique values and for fields that contain short strings you do not have to encode other fields multiple files you can load row groups of multiple parquet files in parallel for large data sets, load the data set as multiple files loading files with differing schemas is not supported use selectors as you do when loading json data to specify data to load you must select a leaf element, an array, or a tuple with your selector this is stricter than using json selectors, which can directly select array fields and json object fields example {"a" \[1,2,3], "b" {"c" 1}} you can extract with any of the selectors in json $a, $a\[], $b, $b c however, parquet only allows for the selectors $a\[], $b c this example assumes this schema // list\<string> (list non null, elements nullable) required group my list (list) { repeated group list { optional binary element (utf8); } } the selector must be $my list\[] , which includes the array syntax for details, see docid\ nc0nxlblmgajnzdzne0jp when you use the format parquet option with an aws s3 source, the endpoint option is required in the create pipeline sql statement auto casting in parquet does not support the automatic conversion to varchar columns you must explicitly cast data to the char data type when you convert parquet data that is not string data to a varchar column or varchar function argument schema evolution the ocient system supports schema evolution when you load a set of parquet files specifically, if the pipeline selects a set of parquet files where an individual file might have more or fewer columns than another, the system attempts to merge those schemas together to support loading without requiring you to create the pipeline again for example, the test table table has three columns create table test table ( col a int null, col b varchar null, col c varchar null ); you have two parquet files with these schemas message file1 schema { optional int32 col a; optional byte array col b (utf8); } message file2 schema { optional byte array col b (utf8); optional byte array col c (utf8); } however, you must specify how to handle the schema evolution within the extract sql statement you can choose to sample the first file only for its schema or sample the entire data set to merge the schemas together this ddl statement samples on one file extract format parquet schema (infer from sample file) the disadvantage is that sampling multiple files can potentially take a long time (scaling with the number of files in the data set) when you execute the create pipeline and start pipeline sql statements if you know that all of the parquet files have the same schema, use this syntax the ocient system does not support the case where a column within the schema changes type for example, if col a is an int type in one file and a varchar type in another the default behavior of schema evolution infers the schema from one file use this syntax to infer from one file extract format parquet schema (infer from sample file) parquet type mapping parquet data types are separated into primitive and logical types the ocient system converts these types to ocient sql types see these tables for the respective conversions true 290,290 unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type true 290,290 unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type the interval data type is not supported the uint64 data type can overflow the bigint conversion the duration data type conversion to bigint preserves the underlying units for example, the number of microseconds stays as microseconds in the bigint data type further, parquet contains nested types that the ocient system also converts to sql types, as shown in this table true 290,290 unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type unhandled content type parquet loading example create a data pipeline that loads parquet files using an aws s3 bucket specify the bucket, endpoint, access key identifier, secret access key, and filter options to find all parquet files in the specified file path use the parquet base table table to store the loaded data retrieve the integer, text, floating point, double, integer, json, and bson fields create pipeline testpipeline source s3 bucket 'testbucket' endpoint 'https //endpoint' access key id '' secret access key '' filter glob '/data/ /2024/ / parquet' prefix '/data/orders/2024/11/' extract format parquet into parquet base table select $int32 field as int32 field, $utf8 field as utf8 field, $float field as float field, $double field as double field, $int64 field as int64 field, $json field as json field, $bson field as bson field; parquet file partitioned data with parquet, you can load file partitioned data from parquet files using the file path structure use the filter set in the file path using the {{hive}} naming standards assume files with these file paths s3 //data/orders/2024/11/dt=2024 11 24/file parquet /dt=2024 11 25/file parquet /dt=2024 11 26/file parquet load the data values in the parquet file partitions using the metadata function with the hive partition syntax and the specified partition key dt from the file paths select int32 field, utf8 field, float field, double field, int64 field, json field, bson field, metadata('hive partition','dt') as file date; for details about this syntax, see docid\ vqvrmdyk8josxmkfsyprc load xml data you can load data in xml format into the ocient system the system supports xml tags, basic elements, nested elements, and cdata, but it does not support xml arrays and attributes supported xml selectors like json selectors, you can select xml data using the $ symbol followed by a list of dot separated json keys the system treats json selectors as lowercase to use case sensitive selectors, you must enclose the selector in double quotation marks, such as $"testselector" with case sensitive selectors having multiple json keys, each key needs double quotation marks, such as $"testdata" "responses" "successresponse" the system does not support json array and tuple selectors xml loading example assume an xml file with this data \<root> \<person> \<name>barbara smith\</name> \<address> \<city>new york\</city> \<zip>12345\</zip> \</address> \<note>\<!\[cdata\[personal ip \<b>127 0 0 1\</b>]]>\</note> \</person> \</root> create a table to contain the ip address record name — name city — name of the city zip — zip code personal ip — ip address create table example xml( name varchar not null, city varchar not null, zip varchar not null, personal ip ipv4 not null ); create a data pipeline xml pipeline that loads the xml file test xml using an aws s3 bucket specify the bucket, endpoint, access key identifier, secret access key, and filter options to find the xml file in the specified path use the example xml table to store the loaded data use json selectors to parse the file and data in each xml tag the system parses the cdata section in the note tag as the literal string personal ip \<b>127 0 0 1\</b> use the substring function to extract the ip address and then transform it into an ipv4 type with the ipv4 function for details, see the docid\ ja8cont33tonx ktruedj and docid\ ezupx17nbsz6c4g5e7o o functions create pipeline xml pipeline source s3 bucket 'testbucket' endpoint 'https //endpoint' access key id '' secret access key '' filter glob '/data/text xml' extract format xml into example xml select $"root" "person" "name" as name, $"root" "address" "city" as city, $"root" "address" "zip" as zip, ipv4(substring($"root" "person" "note", 16, 9)) as personal ip; related links docid\ hlhxjwwhmbdtuouffmzkl docid\ pbyszqvu5wonpgoso qto docid\ vqvrmdyk8josxmkfsyprc