Load Data
Transform Data in Data Pipelin...
Special Data Pipeline Transformation Functions
when you load data in the {{ocient}} system, you can use special loading transformation functions in the select sql statement to transform the data on load explode outer explode outer expands a one dimensional or multidimensional array into its elements with one element per row of output from the system the array can contain a maximum of 8,192 elements, which is the limit for the default settings the function takes an array of rank n and expands the elements, returning elements of rank n 1 for example, when you apply explode outer to an array of integers, the system returns each integer in its own row when you apply explode outer to an array of arrays of integers, the system returns each array of integers in its own row you can add further transforms to these elements by nesting the statement, e g , nanos to timestamp(explode outer($int array field\[])) you can use this function with multiple columns in the same table the system holds constant columns that are not expanded also, the system combines expanded columns together in array order if multiple expanded columns have different array lengths, the system expands the record into n records where n is the maximum length columns with array lengths less than n have null values after the system expands each value limitations when you use this function with multi dimensional arrays, you cannot apply it multiple times to match the number of dimensions, for example each record must have an associated 64 bit record number, which uniquely identifies a row within its partition configure the format of this number using the record number format advanced pipeline option to modify the maximum number of elements that the array to expand can contain for details, see docid\ l8tdfpfzzvzeyabc2h7bq the system uses record numbers for deduplication, so changing the record number format option of a pipeline can change its deduplication behavior always create a new pipeline when you change this option to make rows deduplicate properly to make rows deduplicate properly when you do not set this option, always create a new pipeline when adding an explode outer statement to your pipeline or removing all such statements from your pipeline syntax explode outer(array) argument data type description array array an array to expand example create the test table table to load data with three columns where the first two columns are integers, and the third is a string create table test table ( "i" int null, "j" int null, "v" varchar(536870912) compression dynamic null ); preview the load using this inline source data in json format {"id" 1, "int array" \[1, 2, 3], "string array" \["a", "b"]} use the explode outer function to expand the elements of the arrays in the second and third json data values preview pipeline p1 source inline '{"id" 1, "int array" \[1, 2, 3], "string array" \["a", "b"]}' extract format json into test table select $"id" as i, explode outer($"int array"\[]) as j, explode outer($"string array"\[]) as v; output i j v \ 1 1 a 1 2 b 1 3 null fetched 3 rows the output shows that the inline array values are expanded across the individual rows special functions for array transformations in pipelines the data pipeline functionality supports special transformation functions for arrays that use lambda functions or function expressions to apply sql operations to individual elements in an array these functions are not supported in standard ocient sql queries the supported special functions are transform — transform individual elements in an array apply a transformation expression to every element in the array filter — filter the elements in an array apply a filter function to each element in an array and return an array of elements matching the filter function zip with — combines multiple arrays using a zipping function apply a zipping function to n wise pairs of elements from n arrays and return an array with the results from the zipping function each of these functions takes one or more arrays as input arguments and a lambda or other reference function for applying to the elements in the array a lambda function is an anonymous function that must have a specific signature and return type for the array transformation function where you use the lambda function transform transforms an array based on the logic in a lambda expression the first argument is an expression that evaluates to an array that contains the elements to transform the second argument is a lambda expression or other reference function that defines how to transform each element the second argument must be of the form (x t) > u this specified form returns a value of type u for a specified element x the lambda function specifies the type t of the element x the second argument can also be another reference function for a single argument function transform returns an array of elements with the same length as the array argument the returned array has the type returned by the lambda function for multi dimensional arrays, the transform function applies to the outermost dimension you can nest transform operations to apply them over inner dimensions syntax transform(array, function) argument data type description array t\[] the array of type t elements can be single or multi dimensional the transform function only iterates over the first dimension function lambda function (x t) > u or reference function a lambda function (e g , (elem char) > left(elem, 2) ) that takes an element of type t and returns a value of type u , which can be the same type as t or a different type, or a reference function with the correct argument list (e g , upper ) examples extract the first initial from each element in an array using a lambda function with the specified data in the pipeline source record { "data" { "names" \["jane", "charles", "kim"] }} , create a data pipeline using the transform function in the select sql statement with the lambda function (el char) > left(el,1) this function returns the first character of each element in the input array $data names\[] the example focuses on the select statement and does not show the rest of the data pipeline definition create pipeline select transform($data names\[], (el char) > left(el,1)) as first initial array output \["j", "c", "k] convert all elements in an array to lowercase using a reference function with the specified data in the pipeline source record { "data" { "names" \["jane", "joe", "jill"] }} , create a data pipeline using the transform function in the select sql statement with the lower reference function this function converts the input strings in $data names\[] to lowercase the example focuses on the select statement and does not show the rest of the data pipeline definition create pipeline select transform($data names\[], lower) as lower case name array output \["jane", "joe", "jill"] filter filters elements in an array based on the logic in a lambda expression the first argument is an expression that evaluates to an array that contains the elements to filter the second argument is a lambda expression or reference function that returns true if the system should retain the element in the returned array the second argument must be of the form (x t) > bool this form returns a specified value for the specified element x the lambda function specifies the type t of the element x the filter function includes elements in the first argument array in the resulting array when the lambda function returns true otherwise, the function excludes the elements syntax filter(array, function) argument data type description array t\[] the array of type t elements can be single or multi dimensional the filter function only iterates over the first dimension function reference function or lambda function (x t) > bool a function that takes an element of type t and returns a bool the function can be a lambda function (e g , (elem char) > length(elem) > 10 ) or a function name that has the correct argument list (e g , upper ) example with the specified data in the pipeline source record { "data" { "names" \["jane", "charles", "kim"] }} , filter the names in the input array for elements that are greater than five characters long using the lambda function (name char) > length(name) > 5 the example focuses on the select statement and does not show the rest of the data pipeline definition create pipeline select filter($data names\[], (name char) > length(name) > 5) as long names output \["charles"] zip with merges two or more input arrays, element wise, into a single array using the trailing combining function if one array is shorter than the other, the function appends nulls at the end to match the length of the longest array at the start returns an array with elements that result from applying the function to each pair of elements from array1 and array2 if the arrays are not the same length, the function pads the shorter array with null values the function returns an array of length max(length(array1), length(array2)) containing elements of type v as returned by the lambda function function must be of the form (x t, y u) > v where x is the element from array1 and y is the element from array2 the lambda function specifies the types t and u when zipping more arrays together, use the format (x t1, y t2, z t3) > v where the arguments to the lambda function align with the order of the input arrays of the zip with function syntax zip with(array1, array2 \[, ], function) argument data type description array1 t1\[] the array of type t1 elements can be single or multi dimensional the zip with function only iterates over the first dimension array2 t2\[] the array of type t2 elements can be single or multi dimensional the zip with function only iterates over the first dimension function lambda function (x t) > bool or reference function a function that takes an element of type t and returns a bool the function can be a lambda function (e g , (elem char) > length(elem) > 10 ) or a function name that has the correct argument list (e g , upper ) example with the specified data in the pipeline source record { "line items" \[{ "product name" "suede shoes", "sku" "ft 1023 blu" }, { "product name" "cheap sunglasses", "sku" "zz 7821 k" }] } , concatenate the product name $line items\[] product name strings with the sku $line items\[] sku strings as the format "name(sku)" the example focuses on the select statement and does not show the rest of the data pipeline definition create pipeline select zip with( $line items\[] product name, $line items\[] sku, (name char, sku char) > concat(name, '(', sku, ')') ) output \["suede shoes (ft 1023 blu)", "cheap sunglasses (zz 7821 k)"] lambda functions a lambda function is a temporary function that takes zero or more named arguments as a parenthesis enclosed list and returns an expression the expression can reference objects defined within the closure of the expression, which includes locally named arguments that are locally defined within an enclosing scope, field references, literals, and so on syntax define lambda functions using this syntax where t1 represents an element of type t1 and u represents the return type of the function (t1, ,tn) > u when you use this function in a sql statement, the syntax requires an argument name and the explicit designation of the type of each argument, but the return type u can be inferred for example, define a lambda function that accepts two arguments of type char and int named x and y , respectively (x char, y int) > left(x, y) examples string manipulation using a lambda function convert x to uppercase and return the first character (x char) > left(upcase(x),1) concatenate using a lambda function concatenate two strings (e g , first and last name) with a space between them (x char, y char) > concat(x, ' ', y) reference functions a reference function points to an existing function in the ocient function list specifying the function name allows you to use that function in contexts where you would otherwise need to create a full lambda function the signature of the function must match the argument structure and return value requirements examples the reference function upcase is equivalent to the lambda function (x char) > upcase(x) because the syntax of upcase accepts one argument of type char and returns a value of type char use the upcase function with the transform function to capitalize all names in an array these two sql statements are equivalent transform($names\[], upcase) transform($names\[], (n char) > upcase(n)) related links docid\ l8tdfpfzzvzeyabc2h7bq docid\ uacarixqhe493vlhudb5b docid\ l8tdfpfzzvzeyabc2h7bq docid\ goic9irkky6b99wy3cyia docid t a7tkw6 vf54mueufas