Special Data Pipeline Transformation Functions
When you load data in the System, you can use special loading transformation functions in the SELECT SQL statement to transform the data on load.
EXPLODE_OUTER expands a one-dimensional or multidimensional array into its elements with one element per row of output from the system. The array can contain a maximum of 8,192 elements, which is the limit for the default settings.
The function takes an array of rank N and expands the elements, returning elements of rank N-1. For example, when you apply EXPLODE_OUTER to an array of integers, the system returns each integer in its own row. When you apply EXPLODE_OUTER to an array of arrays of integers, the system returns each array of integers in its own row. You can add further transforms to these elements by nesting the statement, e.g., NANOS_TO_TIMESTAMP(EXPLODE_OUTER($int_array_field[])).
You can use this function with multiple columns in the same table. The system holds constant columns that are not expanded. Also, the system combines expanded columns together in array order. If multiple expanded columns have different array lengths, the system expands the record into N records where N is the maximum length. Columns with array lengths less than N have NULL values after the system expands each value.
Limitations
- When you use this function with multi-dimensional arrays, you cannot apply it multiple times to match the number of dimensions, for example.
- Each record must have an associated 64-bit record number, which uniquely identifies a row within its partition. Configure the format of this number using the RECORD_NUMBER_FORMAT advanced pipeline option to modify the maximum number of elements that the array to expand can contain. For details, see Advanced Pipeline Tuning Options.
- The system uses record numbers for deduplication, so changing the RECORD_NUMBER_FORMAT option of a pipeline can change its deduplication behavior. Always create a new pipeline when you change this option to make rows deduplicate properly.
- To make rows deduplicate properly when you do not set this option, always create a new pipeline when adding an EXPLODE_OUTER statement to your pipeline or removing all such statements from your pipeline.
Syntax
Argument | Data Type | Description |
---|---|---|
array | Array | An array to expand. |
Example
Create the test_table table to load data with three columns where the first two columns are integers, and the third is a string.
Preview the load using this inline source data in JSON format: {"id": 1, "int_array": [1, 2, 3], "string_array": ["a", "b"]}. Use the EXPLODE_OUTER function to expand the elements of the arrays in the second and third JSON data values.
Output
The output shows that the inline array values are expanded across the individual rows.
The data pipeline functionality supports special transformation functions for arrays that use lambda functions or function expressions to apply SQL operations to individual elements in an array. These functions are not supported in standard Ocient SQL queries.
The supported special functions are:
- transform — Transform individual elements in an array. Apply a transformation expression to every element in the array.
- filter — Filter the elements in an array. Apply a filter function to each element in an array and return an array of elements matching the filter function.
- zip_with — Combines multiple arrays using a zipping function. Apply a zipping function to N-wise pairs of elements from N arrays and return an array with the results from the zipping function.
Each of these functions takes one or more arrays as input arguments and a lambda or other reference function for applying to the elements in the array. A lambda function is an anonymous function that must have a specific signature and return type for the array transformation function where you use the lambda function.
Transforms an array based on the logic in a lambda expression. The first argument is an expression that evaluates to an array that contains the elements to transform. The second argument is a lambda expression or other reference function that defines how to transform each element.
The second argument must be of the form (x T) -> U. This specified form returns a value of type U for a specified element x. The lambda function specifies the type T of the element x.
The second argument can also be another reference function for a single argument function.
TRANSFORM returns an array of elements with the same length as the array argument. The returned array has the type returned by the lambda function. For multi-dimensional arrays, the TRANSFORM function applies to the outermost dimension. You can nest TRANSFORM operations to apply them over inner dimensions.
Syntax
Argument | Data Type | Description |
---|---|---|
array | T[] | The array of type T elements can be single or multi-dimensional. The TRANSFORM function only iterates over the first dimension. |
function | Lambda function (x T) -> U or Reference function | A lambda function (e.g., (elem CHAR) -> LEFT(elem, 2)) that takes an element of type T and returns a value of type U, which can be the same type as T or a different type, or a reference function with the correct argument list (e.g., UPPER). |
Examples
Extract the First Initial from Each Element in an Array Using a Lambda Function
With the specified data in the pipeline source record { "data": { "names": ["Jane", "Charles", "Kim"] }}, create a data pipeline using the TRANSFORM function in the SELECT SQL statement with the lambda function (el CHAR) -> LEFT(el,1). This function returns the first character of each element in the input array $data.names[]. The example focuses on the SELECT statement and does not show the rest of the data pipeline definition.
Output: ["J", "C", "K]
Convert All Elements in an Array to Lowercase Using a Reference Function
With the specified data in the pipeline source record { "data": { "names": ["Jane", "Joe", "Jill"] }}, create a data pipeline using the TRANSFORM function in the SELECT SQL statement with the LOWER reference function. This function converts the input strings in $data.names[] to lowercase. The example focuses on the SELECT statement and does not show the rest of the data pipeline definition.
Output: ["jane", "joe", "jill"]
Filters elements in an array based on the logic in a lambda expression. The first argument is an expression that evaluates to an array that contains the elements to filter. The second argument is a lambda expression or reference function that returns true if the system should retain the element in the returned array.
The second argument must be of the form (x T) -> BOOL. This form returns a specified value for the specified element x. The lambda function specifies the type T of the element x.
The FILTER function includes elements in the first argument array in the resulting array when the lambda function returns true. Otherwise, the function excludes the elements.
Syntax
Argument | Data Type | Description |
---|---|---|
array | T[] | The array of type T elements can be single or multi-dimensional. The FILTER function only iterates over the first dimension. |
function | Reference function or Lambda function (x T) -> BOOL | A function that takes an element of type T and returns a BOOL. The function can be a lambda function (e.g., (elem CHAR) -> length(elem) > 10) or a function name that has the correct argument list (e.g., UPPER). |
Example
With the specified data in the pipeline source record { "data": { "names": ["Jane", "Charles", "Kim"] }}, filter the names in the input array for elements that are greater than five characters long using the Lambda function (name CHAR) -> length(name) > 5. The example focuses on the SELECT statement and does not show the rest of the data pipeline definition.
Output: ["Charles"]
Merges two or more input arrays, element-wise, into a single array using the trailing combining function. If one array is shorter than the other, the function appends NULLs at the end to match the length of the longest array at the start.
Returns an array with elements that result from applying the function to each pair of elements from array1 and array2. If the arrays are not the same length, the function pads the shorter array with NULL values. The function returns an array of length max(length(array1), length(array2)) containing elements of type V as returned by the lambda function.
function must be of the form (x T, y U) -> V where x is the element from array1 and y is the element from array2. The lambda function specifies the types T and U. When zipping more arrays together, use the format (x T1, y T2, z T3) -> V where the arguments to the lambda function align with the order of the input arrays of the ZIP_WITH function.
Syntax
Argument | Data Type | Description |
---|---|---|
array1 | T1[] | The array of type T1 elements can be single or multi-dimensional. The ZIP_WITH function only iterates over the first dimension. |
array2 | T2[] | The array of type T2 elements can be single or multi-dimensional. The ZIP_WITH function only iterates over the first dimension. |
function | Lambda function (x T) -> BOOL or Reference function | A function that takes an element of type T and returns a BOOL. The function can be a lambda function (e.g., (elem CHAR) -> length(elem) > 10) or a function name that has the correct argument list (e.g., UPPER). |
Example
With the specified data in the pipeline source record { "line_items": [{ "product_name": "suede shoes", "sku": "FT-1023-BLU" }, { "product_name": "cheap sunglasses", "sku": "ZZ-7821-K" }] }, concatenate the product name $line_items[].product_name strings with the SKU $line_items[].sku strings as the format "name(sku)". The example focuses on the SELECT statement and does not show the rest of the data pipeline definition.
Output: ["suede shoes (FT-1023-BLU)", "cheap sunglasses (ZZ-7821-K)"]
A lambda function is a temporary function that takes zero or more named arguments as a parenthesis-enclosed list and returns an expression. The expression can reference objects defined within the closure of the expression, which includes locally named arguments that are locally defined within an enclosing scope, field references, literals, and so on.
Syntax
Define lambda functions using this syntax where T1 represents an element of type T1 and U represents the return type of the function.
When you use this function in a SQL statement, the syntax requires an argument name and the explicit designation of the type of each argument, but the return type U can be inferred. For example, define a lambda function that accepts two arguments of type CHAR and INT named x and y, respectively.
Examples
String Manipulation Using a Lambda Function
Convert x to uppercase and return the first character.
Concatenate Using a Lambda Function
Concatenate two strings (e.g., first and last name) with a space between them.
A reference function points to an existing function in the Ocient function list. Specifying the function name allows you to use that function in contexts where you would otherwise need to create a full lambda function.
The signature of the function must match the argument structure and return value requirements.
Examples
The reference function UPCASE is equivalent to the lambda function (x CHAR) -> UPCASE(x) because the syntax of UPCASE accepts one argument of type CHAR and returns a value of type CHAR.
Use the UPCASE function with the TRANSFORM function to capitalize all names in an array. These two SQL statements are equivalent.
TRANSFORM($names[], UPCASE)
TRANSFORM($names[], (n CHAR) -> UPCASE(n))