Load Data

Transform Data in Data Pipelines

data pipeline functionality supports data transformation during loading to efficiently convert and clean source data for your chosen table schema you can use two broad classes of data transformation source field references or sql transformation functions source field references when you load data using a data pipeline, you reference extracted data using a special syntax named a source field reference these expressions are either a named reference such as $my field or a numeric field reference such as $1 in structured data formats like json, more complex source field references allow you to transform the structured data into the required form for example, you can extract an array of values from an array of nested objects with an expression such as $data items\[] price to learn more about transforming data with source field references, see docid 1 jhzrhblgnqucl6skiaq docid 7s5nztl8lolpwt2pcnjyz supported transformation functions in data pipelines data pipeline loading supports numerous sql functions for use in transforming data during loading you can use these transformation functions in the into select portion of a pipeline transformation functions generally match the syntax and behavior of the function you use when you query in {{ocient}} the behavior might differ from querying behavior in two ways some additional functions exist that provide behavior that does not exist in queries some existing functions have extended behavior during loading that provides important capabilities that are not in queries the types of transformation functions supported in loading include scalar data conversion and casting of data types special loading transformation functions date and time transformations character data transformations binary data transformations array data transformations logical operations transformations network type functions other data transformations custom user defined docid\ l8tdfpfzzvzeyabc2h7bq using {{groovy}} scalar transformation functions and casting scalar transformation functions include constructors, scalar data conversions, and data casting functions name description supported input type return type example select expression t\[]\(x) casts a string representation of an array of values or the result of a json array extraction expression to the corresponding array with data type t where t is a sql data type multi dimensional arrays are supported to indicate an n dimensional array cast, use n sets of square brackets in the casting function all transformation functions can only apply to an array as a whole to apply functions on an element wise basis, see transform in docid 7h6mczxhldiameojdksvu char array\[] csv decimal(8,4)\[]\[] ($21) as decimal 2d array col json char\[]\($a b names\[]) as names decimal(8,2)\[]\[]\($a\[] b c\[] prices) as prices char\[]\[]\[]\($names\[]\[]\[]) as names bigint(x) converts the specified type to a 64 bit signed integer bigint char bigint csv bigint($22) as int64 col json bigint($a b c) as int64 col binary(x) translates a string representation of bytes and converts them to real bytes the string representation expects hexadecimal representations that convert to the matching bytes with this function hexadecimal data is not case sensitive example source data a891 , f1d890 char binary or varbinary csv binary($30) as varbinary col json binary($a b c) as varbinary col boolean(x) handles textual representation of booleans and converts them to its boolean equivalent you can represent true as any of these values “t”, “true”, “yes”, “1”, “y”, “on” false representation as any of these values “f”, “false”, “no”, “0”, “n”, “off” char bool csv boolean($23) as bool col json boolean($a b c) as bool col byte(x) supports characters and number or byte array representations interprets character data as the ascii numeric representation of a byte if you specify a byte array, the ocient system uses the first element of the array when the system converts text data to a byte, the function throws an error if the value is > 127 or < 128 example '101' > 101 ' 23' > 23 char byte csv byte($24) as byte col json byte($a b c) as byte col char(x) converts the specified input type to its string representation char binary float double int byte smallint date b oolean uuid array char or varchar csv char($25) as varchar col char($26) as char col json char($a b c) as varchar col char($a b c) as char col char($array field\[]) as char col decimal(x,p,s) x is input data to convert to a decimal p is the precision of the decimal (number of significant digits) and s is the scale (number of digits after the decimal point) you can omit p and s or only s in that case, the ocient system defaults p to 31 and s to 16 example decimal(30 0,4,2) > 30 00 char decimal csv decimal($27, 8, 4) as decimal col json decimal($a b c, 8, 4) as decimal col double(x) converts the specified type to its double representation binary char double csv double($28) as double col json double($a b c) as double col float(x) converts the specified type to its float representation binary char float csv float($29) as float col json float($a b c) as float col hash(x,n) translates a string representation of bytes and converts them to bytes the string representation expects hexadecimal representations that convert to the matching bytes with this function hexadecimal data is not case sensitive x is the source data, such as a891 , f1d890 n is the fixed number of bytes in the binary array char hash csv hash($23, 9) as fixed hash binary col json hash($a b c, 9) as fixed hash binary col int(x) converts the specified type to a 32 bit signed integer decimal char smallint int csv int($31) as int32 col json int($a b c) as int32 col ip(x) converts the specified string to an ipv6 address equivalent x is the source data, such as 2001 0db8 1234 5678\ abcd\ ef01 2345 6789 2001\ db8 0 0 0 0 0 2 192 0 2 1 char ip csv ip($32) as ip col json ip($a b c) as ip col ipv4(x) converts the specified string to an ipv4 address equivalent x is the source data, such as 192 0 2 1 char ipv4 csv ipv4($32) as ipv4 col json ipv4($a b c) as ipv4 col smallint(x) converts the specified value to its 16 bit signed integer representation decimal char byte smallint csv smallint($33) as int16 col json smallint($a b c) as int16 col st point(long, lat) accepts a longitude and latitude coordinate for a geospatial point arguments must be of type double or string values that can convert to double for wkt representation, see docid\ c aaugo19qr znvjihgsy both double both char st point csv st point($1, $2) as st point col json st point($a b c, $d e f) as st point col st pointfromtext(x) accepts the wkt text representation for st point with x and y coordinates only example wkt representation 'point( 40 5318 11 8532)' char st point st pointfromtext($1) st pointfromtext($a b c) tinyint(x) converts the specified value to an 8 bit signed integer char tinyint csv tinyint($34) as int8 col json tinyint($a b c) as int8 col tuple <\<t,\[t] >>(x,\[y] ) creates a tuple from the specified types tuple supports arrays of tuples, but does not support tuples within tuples the arguments to the tuple casting function can be either a single source value or must match the number of types defined in the tuple if you specify a single value, the ocient system infers individual elements of the tuple using the tuple format (value1,value2,value3) example source data (123,12 75) is valid text data for a tuple<\<int,decimal(7,2)>> char tuple of type csv tuple<\<int,decimal(7,2),timestamp,binary(7)>>($37) as int decimal timestamp binary tuple col tuple<\<int,decimal(7,2),timestamp,binary(7)>>($4,$8,$1) as int decimal timestamp binary tuple col multi input json tuple<\<int,decimal(7,2),timestamp,binary(7)>>($a, $b, $c)) as int decimal timestamp binary tuple col uuid(x) creates a unique identifier from the specified value example source data '237e9877 e79b 12d4 a765 321741963000' char uuid csv uuid($34) as uuid col json uuid($a b c) as uuid col special data pipeline transformation functions special data pipeline transformation functions include functions that only exist in loading, such as the metadata function to load special data during pipeline execution name description supported input type return type example select expression explode outer(array) expands the contents of the specified one dimensional array by default, the array can contain a maximum of 8,191 elements for supported options, see docid 7h6mczxhldiameojdksvu array elements of the array in their own row explode outer($"data response responses\[] own") as own filter(t\[], (t) > boolean) filters elements in an array based on the logic in a lambda expression the first argument is an expression that evaluates to an array that contains the elements to filter the second argument is a lambda expression (or function reference) that returns true if an element should be retained the second argument must be of the form (x t) > bool this returns a boolean value for a specified element x the type t of the element x is specified in the lambda function elements in the first argument array are included in the resulting array when the lambda function returns true and excluded otherwise for supported options and examples, see docid 7h6mczxhldiameojdksvu array lambda function (x t) > bool or function reference the elements of the array that matched the filter criteria function filter($data names, (name char) > length(name) > 5) metadata(key) extracts the metadata value for the specified key from available metadata for the pipeline for supported key values, see docid\ a3n4wkcawrpo1gtefetmm char varies based on the metadata key metadata('filename') → '/path/to/file/orders 1830374 json' parse delimited array(text, rank, array element delimiter, bracket spec) converts a string of text data representing an array into a char\[] for example, with data in the form \[1;2;3] , executing parse delimited array('\[1;2;3]', 1, ';', '\[]') returns a char\[] with the value \['1','2','3'] text is char data that represents the text to convert to an array rank is an optional argument that represents the numeric rank of the output array the default value is 1 you can only specify rank as a literal value because you cannot retrieve it dynamically from a source field selector array element delimiter is an optional argument that contains the control character for delimiting array elements for example, ',' indicates that a comma separates each element in the text character data the default value definition comes from the array element delimiter value of the data pipeline bracket spec is an optional argument that is the pair of control characters for the open and close array brackets for example, '\[]' indicates that each rank of the array is enclosed in square brackets the default definition comes from the open array and close array characters of the data pipeline valid options are '' , '\[]' , '{}' , '()' , and '<>' char\[] parse delimited array($text array) parse delimited array($text array, 1, ';', '\[]') parse delimited array($two dim text array, 2, ',', '\[]') transform(t\[], (t) > u) transforms an array based on the logic in a lambda expression the first argument is an expression that evaluates to an array that contains the elements to transform the second argument is a lambda expression (or function reference) that defines how to transform each element transform returns an array of elements of the same length as the array argument of the type returned by the lambda function the second argument must be of the form (x t) > u this returns a value of type u for a specified element x the type t of the element x is specified in the lambda function the second argument can also be a function reference for a single argument function for supported options and examples, see docid 7h6mczxhldiameojdksvu array lambda function (x) > u or function reference elements of the array transformed to the return value and type of the anonymous function transform($data names\[], (el char) > lower(el)) as lower case name array transform($data names\[], lower) as lower case name array width bucket(x,min,max,n) creates n equal width buckets in the range \[min,max) as a histogram if x is less than min , then the function returns 0 if x is greater than or equal to the value of max , the function returns n+1 buckets if x is less than the value of max and greater than or equal to the value of min , the function returns the number of the bucket corresponding to where the input value belongs in the histogram the values are inclusive on the lower bound and exclusive on the upper bound, so all buckets are the same width requesting 100 buckets retrieves 102 possible buckets, bucket 0 through 101, where 0 captures all values below the minimum and 101 captures all values at the maximum and beyond x is the value for bucket determination min is the minimum value for the histogram max is the maximum value for the histogram n is the number of buckets in the histogram integer width bucket('$2', 1, 5, 2) as output zip with(array1, , array2, func) merges two or more input arrays, element wise, into a single array using the trailing combining function if one array is shorter than the others, the system appends null values at the end to match the length of the longest array before applying returns an array with elements that result from applying the function func to each pair of elements from array1 and array2 if the arrays are not the same length, the system pads the shorter array with null values func must be of the form (x t, y u) > v , where x is the element from array1 and y is the element from array2 types t and u are specified in the lambda function for supported options and examples, see docid 7h6mczxhldiameojdksvu array array lambda function (x t, y u) > v or function reference an array of length max(length(array1), length(array2)) containing elements of type v as returned by the lambda function zip with($line items\[] product name, $line items\[] sku, (name char, sku char) > concat(name, '\[', sku, ']')) date and time transformation functions date and time transformation functions provide utilities to parse, convert, or transform date , time , and timestamp values in pipelines name description supported input type return type example select expression date(x) converts the specified value to a date value this function always creates dates from a timestamp in utc time if the argument is a char type, the function assumes that the value has a textual representation in the yyyy mm dd format supports any date separator character for example, yyyy\ mm\ dd is also valid char timestamp date csv date($24) as date col date(timestamp($24)) as date col json date($a b c) as date col date(timestamp($a b c)) as date col int to date(arg) converts the specified value to a date this function creates a date type from an int type representing the number of days after the {{unix}} epoch (january 1, 1970) int date int to date($1) as col date 1 micros to time(arg) converts the specified value to the time type, treating the argument with microsecond precision bigint time micros to time($1) as col time micros to timestamp(x) converts the specified value to a timestamp this function creates a timestamp from a bigint that represents the number of microseconds after the unix epoch (january 1, 1970) bigint timestamp micros to timestamp($35) as timestamp col millis to time(arg) converts the specified value to the time type, treating the argument with millisecond precision alias for time(x) where x is the bigint type bigint time millis to time($1) as col time millis to timestamp(x) converts the specified value to a timestamp this function creates a timestamp from a bigint that represents the number of milliseconds after the unix epoch (january 1, 1970) alias for timestamp(x) where x is the bigint type bigint timestamp millis to timestamp($35) as timestamp col nanos to time(x) converts the specified value to a time value if the argument has the bigint type, the function treats the number with nanosecond precision bigint time csv nanos to time($34) as time col nanos to time(bigint($34)) as time col json nanos to time($a b c) as time col nanos to time(bigint($a b c) as time col nanos to timestamp(x) converts the specified value to a timestamp this function creates a timestamp from a bigint that represents the number of nanoseconds after the unix epoch (january 1, 1970) bigint timestamp csv nanos to timestamp($35) as timestamp col json nanos to timestamp($a b c) as timestamp col nanos to timestamp(bigint(1651681788123456789)) > 2022 05 04t16 29 48 123456789 now() returns the current timestamp none timestamp now() as col load timestamp seconds to time(arg) converts the specified value to the time type, treating the argument with second precision bigint time seconds to time($1) as col time seconds to timestamp(x) converts the specified value to a timestamp this function creates a timestamp from a bigint that represents the number of nanoseconds after the unix epoch (january 1, 1970) bigint timestamp csv seconds to timestamp($35) as timestamp col json seconds to timestamp($a b c) as timestamp col nanos to timestamp(bigint(1651681788123456789)) > 2022 05 04t16 29 48 time(x) converts the specified value to a time value if the argument has the bigint type, the function treats the number with millisecond precision if the argument is a char type, the function assumes that the value has a textual representation in the hh\ mm\ ss format supports any separation character for example, hh/mm/ss is also valid char bigint timestamp time csv time($34) as time col time(bigint($34)) as time col json time($a b c) as time col time(bigint($a b c) as time col timestamp(x) converts the specified value to a timestamp this function creates a timestamp from a bigint that represents the number of milliseconds after the unix epoch (january 1, 1970) when you convert from the date type, the function defaults the timestamp value to the start of the day in utc time if the argument is a char type, the function assumes that the value has a textual representation in the yyyy mm dd hh\ mm\ ss\[ sssssssss] format in this case, the nanoseconds are optional the delimiters in the date time value cannot be in a different format than the default value char bigint date timestamp csv timestamp($35) as timestamp col json timestamp($a b c) as timestamp col timestamp('2023 03 01 18 22 58 826') > 2023 03 01t18 22 58 826 timestamp(bigint(1651681788123456789)) > 2022 05 04t16 29 48 123 to date(x,f) converts the specified value to a date x — the value to convert f — (optional) the format string for the interpretation of the textual representation of x the default format is yyyy mm dd for details, see docid\ u47hjysi1oojk6eg2kjut char date to date($3, 'yyyy/mm/dd') to date($a b c, 'dd mm yyyy') to time(x,f) converts the specified value to a time value x — the value to convert f — (optional) the format string for the interpretation of the textual representation of x the default format is hh\ mi\ ss char time to time($n, 'hh\ mi\ ss') to time($n, 'hhmiss') to timestamp(x,f,ft) converts the specified value to a timestamp x — the value to convert f — (optional) the format string for the interpretation of the textual representation of x the default format is yyyy mm dd hh\ mi\ ss ft — (optional) the format type there are two format types "ocient" and "java" if you use the "ocient" type, the function uses the ocient based formatting function to parse the timestamp if you use the "java" type, the function uses the datetimeformatter in {{java}} to parse the timestamp for details, see https //docs oracle com/javase/8/docs/api/java/time/format/datetimeformatter html the default is the "ocient" type if you specify "java" , you must specify a valid datetimeformatter format string, as there is no default for details, see docid\ u47hjysi1oojk6eg2kjut char timestamp to timestamp($n, 'yyyy mm dd hh\ mi\ ss us') to timestamp($n, 'dd mm yyyy hh\ mi\ ss ns') to timestamp($n, 'uuuu mm dd hh\ mm\ ss ssssss', 'java') for the to timestamp function, the ocient formatter currently does not support variable length sections of timestamps (i e , the data might have a different number of fractions of seconds across records) for example, assume this source data 2024 01 01 12 00 00 123456 2024 01 01 12 00 00 1234 then, you must specify the format type as java and enter the datetimeformatter format string to parse the timestamps character data transformation functions character data transformation functions operate on varchar data to transform a string of data name description supported input type return type example select expression concat(arg1, arg2\[, ]) returns a string that is the result of concatenating the arguments provided accepts two or more arguments to concatenate if one of the arguments to concat is null , then concat returns null ℹ️ concat does not support hash or binary arguments in loading arg1 char arg2 char arg3 argn (optional) char char concat($1, $2) concat(trim($first name), ' ', trim($last name)) substring(x, p, l) extract a substring from x beginning at the position offset p with a length l x is the source data in char or binary data type p is the numeric byte position in the record that starts at index 1, which indicates the position to begin reading l is the length of data to return starting from position p for varchar data, by default, the ocient system treats each byte in a source field reference (e g , $your field ) as an ascii encoded character, not multi byte character data to handle multi byte character data such as utf 8 in a source field, you must cast x as char($your field) or use another function that returns char data p and l represent the character offset and the number of characters to retrieve from the source data x char or binary p int l int return type matches the type of x char binary substring($order postcode, 1, 5) as col postcode trim(arg1\[, arg2]) btrim(arg1\[, arg2]) trims the leading and trailing whitespace characters (or optionally, characters specified in the arg2 argument) from the arg1 argument arg1 char arg2 ( optional) must be a char of length 1 char trim(' testtrimfunction ', ' ') > 'testtrimfunction' btrim(' testbtrimfunction ') > 'testbtrimfunction' replace(arg1, arg2, arg3) the arg1 argument is the input string for the replace operation the arg2 argument is the pattern to replace or the target string the arg3 argument is the pattern for the replacement or the replacement string this function replaces occurrences of arg2 in arg1 with arg3 arg1 char arg2 char arg3 char char replace('testreplacefunctionxyz', 'x', 'z') > 'testreplacefunctionzyz' length(arg1) char length(arg1) character length(arg1) these functions perform the same operation of returning the character length of the input string arg1 is the input string ⚠️ when you use these functions with the substring , right , or left function, you might experience unintended results for multi byte characters, such as arabic in utf 8 if you do not cast the field reference to char char int length('hello world') > 11 upper(arg1) ucase(arg1) converts the characters in the input string to all uppercase characters char char upper('hello there') > 'hello there' lower(arg1) lcase(arg1) converts the characters in the input string to all lowercase characters char char lower('obi wan kenobi') > 'obi wan kenobi' locate(arg1, arg2\[, arg3]) position(arg1, arg2\[, arg3]) returns the index position of the first occurrence of the character value arg1 in character value arg2 optionally, you can also include arg3 to offset the search by the specified number of spaces arg1 char arg2 char arg3 long int locate('f', 'testlocatefunction') > 11 position('on', 'testpositionfunction', 12) > 19 left(arg1, arg2) returns the number of characters in the arg1 string equal to the arg2 integer value if the integer is negative, the function returns all characters except the last number of characters as specified by the arg2 value this function treats source field references (e g , $your field ) as single byte ascii characters by default, similar to the substring function arg1 char arg2 long char left('testleftfunction', 4) > 'test' left('testleftfunction', 4) > 'testleftfunc' right(arg1, arg2) returns the number of trailing characters in the arg1 string equal to the arg2 integer value if the integer is negative, the function returns all characters except the first number of characters as specified by the arg2 value this function treats source field references (e g , $your field ) as single byte ascii characters by default, similar to the substring function arg1 char arg2 long char right('green tea', 3) > 'tea' right('coffee beans', 7) > 'beans when you use the substring, right, and left functions with delimited or json formatted data sources, the ocient system does not treat a source field reference (e g , $your field ) as multi byte character data each byte is treated as a single byte ascii character to handle multi byte character data like utf 8 in a source field, you must cast the source field using char($your field) or use another function that returns char data when you use the length, char length, and character length functions with the substring, right, and left functions, you might experience unintended results for multi byte characters, such as arabic in utf 8 if you do not cast the source field reference to char binary data transformation functions binary data transformation functions operate on bytes provided when you use the binary format type in a pipeline these functions are typically used in processing mainframe data outputs name description supported input type return type example select expression float(bytes \[, endianness]) with a byte array in the binary format, parse bytes in standard ieee 754 binary32 (single precision) format bytes is a 4 byte array extracted from binary data endianness (optional) must be a literal string valid values include 'big' and 'little' the default value is the endianness option in the pipeline definition float binary float($"\[5, 4]") as float col1 float($"\[5, 4]", 'little') as float col2 double(bytes \[, endianness]) with a byte array in the binary format, parse bytes in standard ieee 754 binary64 (double precision) format bytes is an 8 byte array extracted from binary data endianness (optional) must be a literal string valid values include 'big' and 'little' the default value is the endianness option in the pipeline definition double binary double(float($"\[5, 8]")) as double col1 double($"\[5, 8]", 'big') as double col2 tinyint(bytes \[\[, signed], endianness]) converts a 1 byte array that represents a signed or unsigned number to the tinyint format ⚠️ if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column, the function throws an error bytes is a single byte array extracted from binary data signed is a string literal that indicates if the bytes represent a signed or unsigned binary format valid values include 'signed' and 'unsigned' the default value is 'signed' endianness is a string literal that indicates whether the bytes are in the big or little endian format valid values include ‘big' and 'little' the default value is the endianness option in the pipeline definition tinyint binary tinyint($"\[9, 1]") as tiny 1 tinyint($"\[9, 1]", 'unsigned') as tiny 2 tinyint($"\[9, 1]", 'signed', 'little') as tiny 3 smallint(bytes \[\[, signed], endianness]) converts a 2 byte array that represents a signed or unsigned number to the smallint format ⚠️ the function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column bytes is a 2 byte array extracted from binary data signed is a string literal that indicates if the bytes represent a signed or unsigned binary format valid values include 'signed' and 'unsigned' the default value is 'signed' endianness is a string literal that indicates whether the bytes are big or little in endian format valid values include ‘big' and 'little' the default value is the endianness option in the pipeline definition smallint binary smallint($"\[9, 2]") as small 1 smallint($"\[9, 2]", 'unsigned') as small 2 smallint($"\[9, 2]", 'signed', 'little') as small 3 int(bytes \[\[, signed], endianness]) converts a 4 byte array that represents a signed or unsigned number to the int format ⚠️ the function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column bytes is a 4 byte array extracted from binary data signed is a string literal that indicates if the bytes represent a signed or unsigned binary format valid values include 'signed' and 'unsigned' the default value is 'signed' endianness is a string literal that indicates whether the bytes are big or little endian format valid values include ‘big' and 'little' the default value is the endianness option in the pipeline definition int binary int($"\[9, 4]") as int 1 int($"\[9, 4]", 'unsigned') as int 2 int($"\[9, 4]", 'signed', 'little') as int 3 bigint(bytes \[\[, signed], endianness]) converts an 8 byte array that represents a signed or unsigned number to the bigint format ⚠️ the function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column bytes is an 8 byte array extracted from binary data signed is a string literal that indicates if the bytes represent a signed or unsigned binary format valid values include 'signed' and 'unsigned' the default value is 'signed' endianness is a string literal that indicates whether the bytes are big or little endian format valid values include ‘big' and 'little' the default value is the endianness option in the pipeline definition bigint binary bigint($"\[9, 8]") as bigint 1 bigint($"\[9, 8]", 'unsigned') as bigint 2 bigint($"\[9, 8]", 'signed', 'little') as bigint 3 decimal(bytes, format, precision, scale \[, sign format]) convert a byte array that represents packed decimal or zoned decimal binary data to the decimal format with this transformation, you can indicate the format of the binary data as either 'zoned' or 'packed' bytes is an array of bytes extracted from binary data format is a string literal that indicates the data format valid values include 'packed' (meaning the array contains decimal data in each nibble for two digits per byte) and 'zoned' (meaning the array contains decimal data in each nibble for one digit per byte) precision is the total number of significant digits scale is the number of digits after the decimal point sign format is a string literal that indicates how the transformation should extract sign information from the bytes valid values include 'leading' (means that the system stores the sign in the upper nibble of the first byte for both 'zoned' and 'packed' data formats), 'leading separate' (means the system stores the sign as a separate new first byte from the digit data), 'trailing' (means that the system stores the sign in the upper nibble of the last byte when the data format is set to 'zoned' and in the lower nibble of the last byte when set to 'packed' ), or 'trailing separate' (means the system stores the sign as a separate new last byte from the digit data) for 'leading' and 'trailing' sign information, accepted sign values are 0x7 , 0xb , and 0xd for negative signs and 0xa , 0xc , 0xe , and 0xf for positive the accepted sign values for 'leading separate' and 'trailing separate' sign information are 0x2d for negative signs and 0x2b for + positive the default value for sign format is 'trailing' for both format data formats decimal binary decimal($"\[1, 10]", 'packed', 4, 2) as revenue decimal($"\[1, 10]", 'zoned', 4, 2, 'leading') as revenue to binary(x, f, s) translates a string representation of bytes and converts them to real bytes x is the string representation of bytes in binary or hash format f is either 'hex' for hexadecimal representations or the character set name (such as 'utf 8' ) for character representations s (optional) is the number of bytes that should be present in the translated value the default 0 means that the number of bytes can vary binary hash to binary($1, 'hex', 3) as col binary 1, to binary($2, 'utf 8') as col binary 2 array data transformation functions name description supported input type return type example select expression array cap(a,n) restrict the length of an array to a maximum number of elements a is an array of any type n is the maximum number of elements for the length of the specified array array of any type array cap(a, 1000) array compact(array) removes null values from the array array is an array that contains any type of elements the array can be single or multi dimensional array of any type array compact($ints\[]) array contains(array, value) returns true if the array contains the specified value if the specified value is null, then this transform function returns null array is an array that contains any type of elements the array can be single or multi dimensional value is an integer boolean or null array contains($ints\[], 1) array distinct(array) return all unique values in an array array is an array that contains any type of elements the array can be single or multi dimensional array of any type array distinct($ints\[]) array sort(array) array sort(array,function) array sort(array) sorts and returns the input array based on the natural ordering of its elements if one of the array elements is null, then this transform function sorts the null values to the end of the array array sort(array,function) sorts and returns the input array based on the results of the specified lambda function the function should have two arguments representing two elements of the array this function should return a negative integer, 0, or a positive integer if the first element is less than, equal to, or greater than the second element, respectively for the first syntax, array is an array of elements with these types bigint , boolean , byte , date , double , float , int , smallint , uuid , or varchar for the second syntax, array is an array that contains any type of elements the array can be single or multi dimensional the function argument is a lambda function with the format (x t, y t) > int or another sql reference function array of any type array sort($"1d"\[], (x int, y int) > case when x = y then 0 when int(coalesce(x, 1000)) < int(coalesce(y, 1000)) then 1 else 1 end) sorts the array in 1d column using the lambda function with a case statement the case statement returns 0 when x and y are equal, returns 1 when the integer cast of the non null value x is less than the corresponding value for y, and returns 1 otherwise element at(array,i) returns the element value of the array or tuple at the specified index you can specify a negative index value as long as the absolute value of the index does not exceed the length of the array or tuple if the index is out of bounds, the function returns null array is an array or tuple i is a 1 based index, specified as an integer any type $a accesses the array \[1,2,3] element at($a,1) returns 1 element at($a, 1) returns 3 element at($a, 2) returns 2 element at($a, 4) returns null flatten(array) transforms an n dimensional array into an n 1 dimensional array if one of the inner array elements is null, this transform function returns null array is a multi dimensional array, for example t\[]\[] , that contains any type of elements array of any type that can be single or multi dimensional flatten($ints\[]\[]) reduce(array, start value, merge function, finish function) applies a merge function to a starting value and all elements in the array, and then reduces the array to a single value optionally, specify a finish function for the returned single value the function applies the merge function to each element of the array with the start value to return a final value array is an array of any type and can be a single or multi dimensional array start value is the starting value of any type merge function is a lambda function or any reference function this lambda function has the form (x u, y t) > u finish function is a lambda function or any reference function this lambda function has the form (x u) > v this function returns the value of type v , which can be any type, from the finish function if you do not specify the finish function, this function returns the type u , which can be any type, from the merge function assuming ints\[] is int\[] of \[1, 2, 3] , reduce($ints\[], 0, (acc int, x int) > acc + x) returns the value 6 assuming ints\[] is int\[] of \[1, 2, 3] , reduce($ints\[], 0, (acc int, x int) > acc + x, boolean) returns the true value in this case, the finish function is the boolean reference function with the input value being the resulting value from the merge function boolean(6) to array length(v,n) create the specified number of copies of any json object in an array v is any json object to copy into the result array n is the number of copies to make in the result array array of any type to array length(v, 3) logical operations transformation functions these transformation functions execute logical operations on the data in the data pipeline name description supported input type return type example select expression if(x,t,f) returns t if the expression x evaluates to true , or the function returns f if x evaluates to false if and iff are function aliases any type any type if('$2', '1', '2') as output case when expression1 then result1 when expression2 then result2 else result end returns the result value based on whether an expression is true if none of the expressions evaluate to true , then the function returns the result value after the else keyword any type any type case when $x then 'true was input in' else 'false was input in' end as toutput null if(x, y) returns the null value if the two input arguments are equal if the arguments are not equal, the function returns x any type any type int(null if($32, '')) int(null if($a b c, '')) network type functions network type functions are also supported in loading for example, subnet(ip($9),13) as ipv6 subnet form1 in the create pipeline select sql statement converts the input data in the column to an ipv6 mapped ipv4 address and then computes the prefix of size 13 for that address for the supported network type functions, see docid\ bjcnqtthbmpj thiwmdlw other data transformation functions other data transformation functions include the extraction of values from the json format and miscellaneous functions such as coalesce , which are commonly used when working with data pipelines name description supported input type return type example select expression coalesce(arg1, arg2) returns arg1 if it is non null; otherwise, the function returns arg2 arg1 and arg2 can be any data type, but their types must match any type matches arg1 and arg2 type coalesce($1, $2) coalesce(int($1), int($2)) coalesce(bigint($1), 13) coalesce(substring(char($1), 1, 2), trim($2)) json extract path text(x, d) extracts a value from a json string x is the json string for extracting a value in char type d is the key of the value to extract in char type char json extract path text(json unescape($real field key), 'b') as col char map keys(x) returns the keys in the specified json string x is the json string array of strings input json string a '{"name" "john", "age" 30, "car"\ null}' map keys($a) output \["age", "car", "name"] map values(x) returns the values in the specified json string x is the json string array of strings input json string a '{"name" "john", "age" 30, "car"\ null}' map values($a) output \["john", "30", null] default keyword you can use the default keyword to include the configured default value of a column in transformation functions example coalesce($my field, default) as my column lambda expressions expression description lambda expression (t1, ,tn) > u a lambda function is a temporary function that takes 0 or more named arguments as a parenthesis enclosed list and returns an expression the expression can reference objects defined within the closure of the expression, which includes locally named arguments that are locally defined within an enclosing scope, field references, literals, etc for examples, see docid 7h6mczxhldiameojdksvu reference function upcase = upcase(x) > char tuple = tuple(t1, ,tn) > tuple<\<t1, ,tn>> a reference function points to an existing function in the ocient function list specifying the function name allows you to use that function in contexts where you would otherwise need to create a full lambda function for examples, see docid 7h6mczxhldiameojdksvu user defined data pipeline functions you can create a user defined data pipeline function using the create pipeline function sql statement and groovy after you create data pipeline functions, you can execute them in data pipelines like any other function to transform records for more details, see docid\ l8tdfpfzzvzeyabc2h7bq and docid\ vk8kyybwfton5ax4e q1i related links docid\ u47hjysi1oojk6eg2kjut docid 7s5nztl8lolpwt2pcnjyz docid 1 jhzrhblgnqucl6skiaq docid\ a3n4wkcawrpo1gtefetmm docid 2ua4kqafcqplu gi6oorz docid\ yqk wibdyxiq8dxewhxhf docid 5xxtimfhjnlyxs48wxsxs