Load Data
Data Formats for Data Pipeline...
JSON Selectors Examples in Data Pipelines
the data pipeline functionality enables you to load data in the structured json data format use these examples to see how you can access the structure in different ways for an overview of loading json data, see data formats for data pipelines docid 1 jhzrhblgnqucl6skiaq data pipeline load of json data from kafka docid 2ua4kqafcqplu gi6oorz scalar extraction from json these examples load a scalar into the varchar column named customer name use the create table sql statement to create the table with this column create table my schema my table (customer name varchar not null); for a json file with the value at the top level, the selector can refer to the key directly load a file with this data {"name" "john"} use the selector $name create pipeline my pipeline 	source filesystem 	 filter 'data json' 	extract 	 format json 	 	into my schema my table select $name as customer name; list all top level json keys used in the select statement in json fields of the create pipeline sql statement for a json file that has a nested value, you need a complex selector {"a" {"b" {"name" "john"}}} in this case, the same create pipeline statement for this file has $a b name instead of $name create pipeline my pipeline 	source filesystem 	 filter 'data json' 	extract 	 format json 	into my schema my table select $a b name as customer name; null and empty handling for json scalars the {{ocient}} system handles all json null, empty, and missing values in the same way the system loads these values as null these values fail to load into non nullable columns provide an explicit default in the pipeline or use the column default if null option to accept the configured column default instead of attempting to load null values the json data contains this information {"a" {"b" {"name" "john"}}} {"blah" {}} in this case, the output table has two rows because there are two records in the json source file create pipeline my pipeline source filesystem 	 filter 'data json' 	extract 	 format json into my schema my table select $a b name as customer name; output ocient> select from my schema my table; name \ john null fetched 2 rows when you load all these json files with the same pipeline definition, the ocient system creates a table with the same two rows the middle value is empty {"a" {"b" {"name" "john"}}} {"a" {"b" {}}} the middle value is null {"a" {"b" {"name" "john"}}} {"a" {"b" null} the leaf value is null {"a" {"b" {"name" "john"}}} {"a" {"b" {"name" null}}} array extraction from json supported array extraction scenarios are one dimensional arrays multi dimensional arrays array projection is the nested application of json selectors over the elements of arrays for both one dimensional and multidimensional arrays arbitrary nesting of arrays within json objects, including splitting array dimensions across the json path loading individual array elements, which can also be arrays mapping a transformation function over array elements is not supported one dimensional arrays the next few examples load the same data into the my table table both directly and using array projection create table my schema my table (customer names varchar\[] not null); load data directly from a json array of strings the json data contains this information { "a" { "b" { "names" \["john", "bob", "rajiv"] } } } create pipeline my pipeline source filesystem 	 filter 'data json' extract 	 format json into my schema my table select $a b names\[] as customer names; output ocient> select from my schema my table; customer names \ \[john, bob, rajiv] fetched 1 row array projection array projection is the application of the specified selector over all elements in an array, similar to a for loop load data using array projection into the same table using this json data { "a" \[ {"name" "john", "hometown" "chicago"}, {"name" "bob", "hometown" "cucamonga"}, {"name" "rajiv", "hometown" "glendale heights"} ] } create pipeline my pipeline source filesystem 	 filter 'data json' extract 	 format json into my schema my table select $a\[] name as customer names; output ocient> select from my schema my table; customer names \ \[john, bob, rajiv] fetched 1 row the output of both loading operations is the same repeat this example with data in an array that contains nested objects the json file contains this information { "a" \[ {"b" {"name" "john", "hometown" "chicago"}}, {"b" {"name" "bob", "hometown" "cucamonga"}}, {"b" {"name" "rajiv", "hometown" "glendale heights"}} ] } create pipeline my pipeline source filesystem 	 filter 'data json' extract 	 format json into my schema my table select $a\[] b name as customer names; output ocient> select from my schema my table; customer names \ \[john, bob, rajiv] fetched 1 row the output is still the same multi dimensional arrays multi dimensional arrays work similarly to one dimensional arrays these examples load the same data into the my table table both directly, and then using array projection create table my schema my table (customer names varchar\[]\[]\[] not null); load data directly from a json array of strings the json file contains this information { "a" { "b" { "names" \[ \[\["john", "bob", "rajiv"], \["anna", "hanna", "vanna", "rosanna"]], \[\["masha", "natasha", "sasha"], \["chad", "thad", "brad"]]] } } } create pipeline my pipeline source filesystem 	 filter 'data json' extract 	 format json into my schema my table select $a b names\[]\[]\[] as customer names; load data using array projection the json file contains this information { "a" \[ \[ \[ {"b" {"name" "john", "hometown" "chicago"}}, {"b" {"name" "bob", "hometown" "cucamonga"}}, {"b" {"name" "rajiv", "hometown" "glendale heights"}}, ], \[ {"b" {"name" "anna", "hometown" "chicago"}}, {"b" {"name" "hanna", "hometown" "cucamonga"}}, {"b" {"name" "vanna", "hometown" "glendale heights"}}, {"b" {"name" "rosanna", "hometown" "mahwah"}}, ], ], \[ \[ {"b" {"name" "masha", "hometown" "boston"}}, {"b" {"name" "natasha", "hometown" "cambridge"}}, {"b" {"name" "sasha", "hometown" "summerville"}}, ], \[ {"b" {"name" "chad", "hometown" "champaign"}}, {"b" {"name" "thad", "hometown" "urbana"}}, {"b" {"name" "brad", "hometown" "mahomet"}}, ], ], ] } $a\[]\[]\[] b name accesses each object such as {"b" {"name" "john", "hometown" "chicago"}} and uses b name to apply this nested attribute selector on all of the elements in the inner arrays create pipeline my pipeline source filesystem 	 filter 'data json' extract 	 format json into my schema my table select $a\[]\[]\[] b name as customer names; output ocient> select from my schema my table; customer names \ \[\[\[john, bob, rajiv], \[anna, hanna, vanna, rosanna]], \[\[masha, natasha, sasha], \[chad, thad, brad]]] fetched 1 row the output of both loading operations is the same multi dimensional arrays with dimensions split across the json path to load a two dimensional array, for example, use a one dimensional array that contains json objects, each of which contains another one dimensional array load the relevant parts of this json object into a two dimensional array in this example, the column has varchar\[]\[] type and the select expression is $x a\[] b c\[] name the loaded data is a two dimensional array of names the json file contains this information { "x" { "a" \[ { "b" { "c" \[ {"name" "john", "hometown" "chicago"}, {"name" "bob", "hometown" "cucamonga"}, {"name" "rajiv", "hometown" "glendale heights"}, ] } }, { "b" { "c" \[ {"name" "anna", "hometown" "chicago"}, {"name" "hanna", "hometown" "cucamonga"}, {"name" "vanna", "hometown" "glendale heights"}, {"name" "rosanna", "hometown" "mahwah"}, ] } }, ] } } create table my schema my table (customer names varchar\[]\[]); s create pipeline my pipeline source filesystem 	 filter 'data json' extract 	 format json into my schema my table select $x a\[] b c\[] name as customer names; output ocient> select from my schema my table; customer names \ \[\[john, bob, rajiv], \[anna, hanna, vanna, rosanna]] fetched 1 row the output is a two dimensional array of names extract data from individual array elements extract data from an element of a json array that element can also be an array in this example, extract the first element of a , which is an array json array indexes start at 1 so that all array indexing in pipelines are consistent with the sql standard the json file contains this information { "a" \[ {"b" \[{"name" "john", "hometown" "chicago"}]}, { "b" \[ {"name" "bob", "hometown" "cucamonga"}, {"name" "rajiv", "hometown" "glendale heights"}, ] }, {"b" \[{"name" "rajiv", "hometown" "glendale heights"}]}, ] } create table my schema my table (customer names varchar\[]); create pipeline my pipeline source filesystem 	 filter 'data json' extract 	 format json into my schema my table select $a\[2] b\[] name as customer names; output ocient> select from my table my schema; customer names \ \[bob, rajiv] fetched 1 row null and empty handling for json arrays the ocient system handles null, empty, and missing values the same way for arrays as for scalars the system converts a value that is null, empty, or missing to null and loads it as null provide an explicit default in the pipeline or use the column default if null option to accept the configured column default instead of attempting to load null tuples of scalars data pipeline loading supports tuples of scalars in the json source tuples of arrays and tuples of tuples are not supported the specification of tuple elements using curly braces is only supported for the basic case (i e $a {b,c} ) more complex selectors are not supported the recommended way of specifying tuples is to use an individual json selector for each tuple element, such as $a b or $a c you can apply functions to tuple elements basic tuple construction load two strings into a tuple {"users hometowns" {"a" {"name" "bob", "hometown" "cucamonga"}}} create table my schema my table (users hometowns tuple<\<varchar,varchar>>); create pipeline my pipeline select tuple<\<char,char>>($users hometowns a name, $users hometowns a hometown) as users hometowns; output ocient> select from my schema my table; users hometowns \ <\<bob, cucamonga>> fetched 1 row applying functions (transformations) to tuple elements in this example, load a tuple where elements have different types, varchar and timestamp , and where the second element has to be converted from a json string into an ocient timestamp the json file contains this information {"event timestamp" {"event" "my event", "timestamp" "1980 02 03 15 16 17 12345"}} create table my schema my table (event timestamp tuple<\<varchar,timestamp>> not null); create pipeline my pipeline select tuple<\<char,timestamp>>($event timestamp event, to timestamp(char($event timestamp timestamp), 'yyyy mm dd hh\ mm\ ss ssssss')) as event timestamp; output ocient> select from my schema my table; event timestamp \ <\<my event, 1980 02 03 15 16 17 123450000>> fetched 1 row the output table has one row of tuples null and empty handling for json tuples all the rules for handling null, empty, and missing elements that apply to scalars and arrays also apply to tuples if any part of the selector is null, empty, or missing, data pipeline loading converts that value to null additionally, because you can apply functions to tuple elements (and not array elements), you can use the null if function to convert a tuple element to null for example, tuple<\<char,varchar>>( $a name, null if($a hometown, 'n/a') ) indicates to the pipeline that the string 'n/a' signifies null for the hometown element but not for the name element special characters in json keys this example shows loading data where the json key names contain a special character create a table for the load using the customer name column create table my schema my table (customer name varchar not null); load a file with this data the json key name contains the special character {"first name" "john"} use double quotes for the selector $first name in the create pipeline sql statement create pipeline my pipeline 	source filesystem 	 filter 'data json' 	extract 	 format json 	 	into my schema my table select $"first name" as customer name; array extraction operations other operations exist in the source field selector syntax to flatten or compact arrays while extracting the data flatten arrays flattening means that the extracted array should not increase the rank, and instead, sub arrays should be concatenated to flatten an array, use the underscore character in the array selector (e g , $data array\[ ] ) example this json data contains a nested array with two sub arrays having data \[1,2,3] and \[4,5,6] { a { b \[ {c \[1,2,3] }, {c \[4,5,6] } ] } } flatten the nested array a b\[ ] c\[] output \[1,2,3,4,5,6] this selected array has rank 1, and not 2 contrast this with the non flattened selector a b\[] c\[] , which would have the output \[\[1,2,3],\[4,5,6]] you can also flatten arrays with the flatten function for details, see transform data in data pipelines docid\ aimcafoydn2xf fgqssys the functionality of the operator and function is equivalent compact arrays compaction eliminates null values from the output data the exact bit pattern of null in the source data is source type dependent for json, null is a literal keyword that is unambiguous for csv or other less defined types, the configuration determines which exact bits equate to null to compact an array, use the exclamation mark character in the array selector(e g , $data array\[!] ) example this json data contains a nested array with four null values { a { b { c \[ 1,null,2,null,3,null,null] } } } compact the nested array a b c\[!] output \[1,2,3] you can also compact arrays with the array compact function for details, see transform data in data pipelines docid\ aimcafoydn2xf fgqssys the functionality of the operator and function is equivalent related links data pipeline load of parquet data from s3 docid\ yqk wibdyxiq8dxewhxhf data pipeline load of parquet data from s3 docid\ yqk wibdyxiq8dxewhxhf data types for data pipelines docid 7s5nztl8lolpwt2pcnjyz data formats for data pipelines docid 1 jhzrhblgnqucl6skiaq monitor data pipelines docid\ jtogwzl3dyxndxierzyzt