Source Field References
When you load data using a data pipeline, you reference extracted data using a special syntax named a source field reference. These expressions are either a named reference such as$my_field or a numeric field reference such as $1.
In structured data formats like JSON, more complex source field references allow you to transform the structured data into the required form. For example, you can extract an array of values from an array of nested objects with an expression such as $data.items[].price.
To learn more about transforming data with source field references, see:
Supported Transformation Functions in Data Pipelines
Data pipeline loading supports numerous SQL functions for use in transforming data during loading. You can use these transformation functions in theINTO ... SELECT portion of a pipeline. Transformation functions generally match the syntax and behavior of the function you use when you query in . The behavior might differ from querying behavior in two ways:
- Some additional functions exist that provide behavior that does not exist in queries.
- Some existing functions have extended behavior during loading that provides important capabilities that are not in queries.
- Scalar data conversion and casting of data types
- Special loading transformation functions
- Date and time transformations
- Character data transformations
- Binary data transformations
- Array data transformations
- Logical operations transformations
- Network type functions
- Other data transformations
- Custom user-defined Pipeline Functions using
Scalar Transformation Functions and Casting
Scalar transformation functions include constructors, scalar data conversions, and data-casting functions.| Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
|---|---|---|---|---|
T[]\(X) | Casts a string representation of an array of values or the result of a JSON array extraction expression to the corresponding array with data type T where T is a SQL data type.Multi-dimensional arrays are supported. To indicate an N-dimensional array cast, use N sets of square brackets in the casting function. All transformation functions can only apply to an array as a whole. To apply functions on an element-wise basis, see TRANSFORM in Special Data Pipeline Transformation Functions. | CHAR | ARRAY[] | CSV:DECIMAL(8,4)[][] ($21) as decimal_2d_array_colJSON: CHAR[]\($a.b.names[]) as namesDECIMAL(8,2)[][]\($a[].b.c[].prices) as pricesCHAR[][][]\($names[][][]) as names |
BIGINT(X) | Converts the specified type to a 64-bit signed integer. | BIGINT CHAR | BIGINT | CSV:BIGINT($22) as int64_colJSON: BIGINT($a.b.c) as int64_col |
BINARY(X) | Translates a string representation of bytes and converts them to real bytes. The string representation expects hexadecimal representations that convert to the matching bytes with this function. Hexadecimal data is not case-sensitive. Example source data: a891, F1D890 | CHAR | BINARY or VARBINARY | CSV:BINARY($30) as varbinary_colJSON: BINARY($a.b.c) as varbinary_col |
BOOLEAN(X) | Handles textual representation of Booleans and converts them to its Boolean equivalent. You can represent true as any of these values:“t”, “true”, “yes”, “1”, “y”, “on”false representation as any of these values:“f”, “false”, “no”, “0”, “n”, “off” | CHAR | BOOL | CSV:BOOLEAN($23) as bool_colJSON: BOOLEAN($a.b.c) as bool_col |
BYTE(X) | Supports characters and number or byte array representations. Interprets character data as the ASCII numeric representation of a byte. If you specify a byte array, the Ocient System uses the first element of the array. When the system converts text data to a byte, the function throws an error if the value is > 127 or < -128.Example: '101' -> 101'-23' -> -23 | CHAR | BYTE | CSV:BYTE($24) as byte_colJSON: BYTE($a.b.c) as byte_col |
CHAR(X) | Converts the specified input type to its string representation. | CHAR BINARY FLOAT DOUBLE INT BYTE SMALLINT DATE BOOLEAN UUID ARRAY | CHAR or VARCHAR | CSV:CHAR($25) as varchar_colCHAR($26) as char_colJSON: CHAR($a.b.c) as varchar_colCHAR($a.b.c) as char_colCHAR($array_field[]) as char_col |
DECIMAL(X,P,S) | X is input data to convert to a decimal.P is the precision of the decimal (number of significant digits) and S is the scale (number of digits after the decimal point).You can omit P and S or only S. In that case, the Ocient System defaults P to 31 and S to 16.Example: DECIMAL(30.0,4,2) -> 30.00 | CHAR | DECIMAL | CSV:DECIMAL($27, 8, 4) as decimal_colJSON: DECIMAL($a.b.c, 8, 4) as decimal_col |
DOUBLE(X) | Converts the specified type to its double representation. | BINARY CHAR | DOUBLE | CSV:DOUBLE($28) as double_colJSON: DOUBLE($a.b.c) as double_col |
FLOAT(X) | Converts the specified type to its float representation. | BINARY CHAR | FLOAT | CSV:FLOAT($29) as float_colJSON: FLOAT($a.b.c) as float_col |
HASH(X,N) | Translates a string representation of bytes and converts them to bytes. The string representation expects hexadecimal representations that convert to the matching bytes with this function. Hexadecimal data is not case-sensitive. X is the source data, such as:a891, F1D890N is the fixed number of bytes in the binary array. | CHAR | HASH | CSV:HASH($23, 9) as fixed_hash_binary_colJSON: HASH($a.b.c, 9) as fixed_hash_binary_col |
INT(X) | Converts the specified type to a 32-bit signed integer. | DECIMAL CHAR SMALLINT | INT | CSV:INT($31) as int32_colJSON: INT($a.b.c) as int32_col |
IP(X) | Converts the specified string to an IPv6 address equivalent.X is the source data, such as:2001:0db8:1234:5678:abcd:ef01:2345:67892001:db8:0:0:0:0:0:2192.0.2.1 | CHAR | IP | CSV:IP($32) as ip_colJSON: IP($a.b.c) as ip_col |
IPV4(X) | Converts the specified string to an IPv4 address equivalent.X is the source data, such as:192.0.2.1 | CHAR | IPV4 | CSV:IPV4($32) as ipv4_colJSON: IPV4($a.b.c) as ipv4_col |
SMALLINT(X) | Converts the specified value to its 16-bit signed integer representation. | DECIMAL CHAR BYTE | SMALLINT | CSV:SMALLINT($33) as int16_colJSON: SMALLINT($a.b.c) as int16_col |
ST_POINT(Long, Lat) | Accepts a longitude and latitude coordinate for a geospatial point. Arguments must be of type double or string values that can convert to double.For WKT representation, see ST_POINTFROMTEXT. | Both DOUBLE Both CHAR | ST_POINT | CSV:ST_POINT($1, $2) as st_point_colJSON: ST_POINT($a.b.c, $d.e.f) as st_point_col |
ST_POINTFROMTEXT(X) | Accepts the WKT text representation for ST_POINT with X and Y coordinates only.Example WKT representation: 'POINT(-40.5318 11.8532)' | CHAR | ST_POINT | st_pointfromtext($1)st_pointfromtext($a.b.c) |
TINYINT(X) | Converts the specified value to an 8-bit signed integer. | CHAR | TINYINT | CSV:TINYINT($34) as int8_colJSON: TINYINT($a.b.c) as int8_col |
TUPLE <<T,[T]*>>(X,[Y]*) | Creates a tuple from the specified types. TUPLE supports arrays of tuples, but does not support tuples within tuples. The arguments to the TUPLE casting function can be either a single source value or must match the number of types defined in the TUPLE. If you specify a single value, the Ocient System infers individual elements of the tuple using the TUPLE format (value1,value2,value3).Example source data: (123,12.75) is valid text data for a TUPLE<<INT,DECIMAL(7,2)>>. | CHAR | TUPLE of type | CSV:TUPLE<<INT,DECIMAL(7,2),TIMESTAMP,BINARY(7)>>($37) as int_decimal_timestamp_binary_tuple_colTUPLE<<INT,DECIMAL(7,2),TIMESTAMP,BINARY(7)>>($4,$8,$1) as int_decimal_timestamp_binary_tuple_col_multi_inputJSON: TUPLE<<INT,DECIMAL(7,2),TIMESTAMP,BINARY(7)>>($a, $b, $c)) as int_decimal_timestamp_binary_tuple_col |
UUID(X) | Creates a unique identifier from the specified value. Example source data: '237e9877-e79b-12d4-a765-321741963000' | CHAR | UUID | CSV:UUID($34) as uuid_colJSON: UUID($a.b.c) as uuid_col |
Special Data Pipeline Transformation Functions
Special data pipeline transformation functions include functions that only exist in loading, such as theMETADATA function to load special data during pipeline execution.
| Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
|---|---|---|---|---|
EXPLODE_OUTER(array) | Expands the contents of the specified one-dimensional array. By default, the array can contain a maximum of 8,191 elements. For supported options, see Special Data Pipeline Transformation Functions. | ARRAY | Elements of the array in their own row. | EXPLODE_OUTER($"data.response.responses[].own") AS own |
FILTER(T[], (T) -> Boolean) | Filters elements in an array based on the logic in a lambda expression. The first argument is an expression that evaluates to an array that contains the elements to filter. The second argument is a lambda expression (or function reference) that returns true if an element should be retained.The second argument must be of the form (x T) -> BOOL. This returns a boolean value for a specified element x. The type T of the element x is specified in the lambda function.Elements in the first argument array are included in the resulting array when the lambda function returns true and excluded otherwise.For supported options and examples, see Special Data Pipeline Transformation Functions. | ARRAY Lambda Function (x T) -> BOOL or Function Reference | The elements of the array that matched the filter criteria function. | FILTER($data.names, (name CHAR) -> length(name) > 5) |
LOOKUP(lookup_source_name, value, join_column_name, return_column_name) | Look up and load data in an external data source. For details, see Load Data from External Sources in Data Pipelines . | Varies based on the value argument. | Varies based on the returned data in the return_column_name argument. | LOOKUP('sqlite_db', $2, 'firstname', 'id') |
METADATA(key) | Extracts the metadata value for the specified key from available metadata for the pipeline. For supported key values, see Load Metadata and File-Based Partitioned Data in Data Pipelines. | CHAR | Varies based on the metadata key. | METADATA('filename') → '/path/to/file/orders_1830374.json' |
PARSE_DELIMITED_ARRAY(text, rank, array_element_delimiter, bracket_spec) | Converts a string of text data representing an array into a CHAR[].For example, with data in the form [1;2;3], executing PARSE_DELIMITED_ARRAY('[1;2;3]', 1, ';', '[]') returns a CHAR[] with the value ['1','2','3']. | text is CHAR data that represents the text to convert to an array.rank is an optional argument that represents the numeric rank of the output array. The default value is 1. You can only specify rank as a literal value because you cannot retrieve it dynamically from a source field selector.array_element_delimiter is an optional argument that contains the control character for delimiting array elements. For example, ',' indicates that a comma separates each element in the text character data. The default value definition comes from the ARRAY_ELEMENT_DELIMITER value of the data pipeline.bracket_spec is an optional argument that is the pair of control characters for the open and close array brackets. For example, '[]' indicates that each rank of the array is enclosed in square brackets. The default definition comes from the OPEN_ARRAY and CLOSE_ARRAY characters of the data pipeline. Valid options are: '', '[]', '{}', '()', and '<>'. | CHAR[] | PARSE_DELIMITED_ARRAY($text_array)PARSE_DELIMITED_ARRAY($text_array, 1, ';', '[]')PARSE_DELIMITED_ARRAY($two_dim_text_array, 2, ',', '[]') |
TRANSFORM(T[], (T) -> U) | Transforms an array based on the logic in a lambda expression. The first argument is an expression that evaluates to an array that contains the elements to transform. The second argument is a lambda expression (or function reference) that defines how to transform each element.TRANSFORM returns an array of elements of the same length as the array argument of the type returned by the lambda function.The second argument must be of the form (x T) -> U . This returns a value of type U for a specified element x. The type T of the element x is specified in the lambda function.The second argument can also be a function reference for a single argument function. For supported options and examples, see Special Data Pipeline Transformation Functions. | ARRAY Lambda Function (x) -> U or Function Reference | Elements of the array transformed to the return value and type of the anonymous function. | TRANSFORM($data.names[], (el CHAR) -> LOWER(el)) as lower_case_name_array TRANSFORM($data.names[], LOWER) as lower_case_name_array |
WIDTH_BUCKET(X,min,max,N) | Creates N equal-width buckets in the range [min,max) as a histogram. If X is less than min, then the function returns 0. If X is greater than or equal to the value of max, the function returns N+1 buckets. If X is less than the value of max and greater than or equal to the value of min, the function returns the number of the bucket corresponding to where the input value belongs in the histogram.The values are inclusive on the lower bound and exclusive on the upper bound, so all buckets are the same width. Requesting 100 buckets retrieves 102 possible buckets, bucket 0 through 101, where 0 captures all values below the minimum and 101 captures all values at the maximum and beyond. | X is the value for bucket determination.min is the minimum value for the histogram.max is the maximum value for the histogram.N is the number of buckets in the histogram. | Integer | WIDTH_BUCKET('$2', 1, 5, 2) AS output |
ZIP_WITH(array1, ..., array2, func) | Merges two or more input arrays, element-wise, into a single array using the trailing combining function. If one array is shorter than the others, the system appends NULL values at the end to match the length of the longest array before applying.Returns an array with elements that result from applying the function func to each pair of elements from array1 and array2. If the arrays are not the same length, the system pads the shorter array with NULL values.func must be of the form (x T, y U) -> V, where x is the element from array1 and y is the element from array2. Types T and U are specified in the lambda function.For supported options and examples, see Special Data Pipeline Transformation Functions. | ARRAY ARRAY Lambda Function (x T, y U) -> V or Function Reference | An array of length max(length(array1), length(array2)) containing elements of type V as returned by the lambda function. | ZIP_WITH($line_items[].product_name, $line_items[].sku, (name CHAR, sku CHAR) -> concat(name, '[', sku, ']')) |
Date and Time Transformation Functions
Date and time transformation functions provide utilities to parse, convert, or transformDATE, TIME, and TIMESTAMP values in pipelines.
| Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
|---|---|---|---|---|
DATE(X) | If the input type is a CHAR, the DATE function parses the string into a DATE type according to the format YYYY[-/]MM[-/]DD.... The format is in this order:4-digit year dash or backlash 2-digit month a dash or backslash 2-digit day any number of trailing characters If the input type is a TIMESTAMP, the function parses the date portion of the timestamp (in UTC) into a DATE type. | CHAR TIMESTAMP | DATE | CSV:DATE($24) as date_colDATE(TIMESTAMP($24)) as date_colJSON: DATE($a.b.c) as date_colDATE(TIMESTAMP($a.b.c)) as date_col |
TO_DATE(X,F,FT) | Converts the specified value to a DATE. X — The value to convert.F — (optional) The format string for the interpretation of the textual representation of X.FT — (optional) The format type. There are two format types: ocient and java. If you use the ocient type, the function uses the Ocient-based formatting function to parse the date. For details, see TO_DATE. If you use the java type, the function uses the DateTimeFormatter in to parse the timestamp. For details, see DateTimeFormatter (Java Platform SE 8 ). The default is the ocient format type with format string YYYY-MM-DD. If you specify the java format type, you must specify a valid DateTimeFormatter format string, as there is no default. | CHAR | DATE | to_date($3, 'YYYY/MM/DD')to_date($a.b.c, 'DD.MM.YYYY') |
INT_TO_DATE(arg) | Creates a DATE type from an INT type representing the number of days after the epoch (January 1, 1970). | INT | DATE | INT_TO_DATE($1) AS col_date_1 |
TIME(X) | If the input type is a CHAR, the TIME function parses the string into a TIME according to the format HH:MI[:SS[.NS]]. The format is a 2-digit hour between 00 and 23, a colon, a 2-digit minute, an optional colon and 2-digit second, and an optional period and 1-9 digits for fractional seconds. The value can end optionally in Z to denote UTC (Zulu). The value can also end optionally in ±TZH[:TZM] to denote the timezone offset. This format is a plus or minus sign, a 2-digit hour, and an optional colon and 2-digit minute. The system resolves all values to UTC, so Z, +00:00, and -00 are equivalent.If the input type is a BIGINT, the function parses the number with millisecond precision.If the input type is a TIMESTAMP, the function parses the date portion of the timestamp (in UTC) into a TIME type. | CHAR BIGINT TIMESTAMP | TIME | CSV:TIME($34) as time_colTIME(BIGINT($34)) as time_colJSON: TIME($a.b.c) as time_colTIME(BIGINT($a.b.c) as time_col |
TO_TIME(X,F,FT) | Converts the specified value to a TIME value. X — The value to convert.F — (optional) The format string for the interpretation of the textual representation of X.FT — (optional) The format type. There are two format types: ocient and java. If you use the ocient type, the function uses the Ocient-based formatting function to parse the time. For details, see TO_TIMESTAMP. If you use the java type, the function uses the DateTimeFormatter in Java to parse the timestamp. For details, see DateTimeFormatter (Java Platform SE 8 ). The default is the ocient format type with format string HH:MI:SS. If you specify the java format type, you must specify a valid DateTimeFormatter format string, as there is no default. | CHAR | TIME | to_time($n, 'HH:MI:SS')to_time($n, 'HHMISS') |
SECONDS_TO_TIME(arg) | Converts the specified value to the TIME type, treating the argument with second precision. | BIGINT | TIME | SECONDS_TO_TIME($1) AS col_time |
MILLIS_TO_TIME(arg) | Converts the specified value to the TIME type, treating the argument with millisecond precision.Alias for TIME(X) where X is the BIGINT type. | BIGINT | TIME | MILLIS_TO_TIME($1) AS col_time |
MICROS_TO_TIME(arg) | Converts the specified value to the TIME type, treating the argument with microsecond precision. | BIGINT | TIME | MICROS_TO_TIME($1) AS col_time |
NANOS_TO_TIME(X) | Converts the specified value to a TIME value. If the argument has the BIGINT type, the function treats the number with nanosecond precision. | BIGINT | TIME | CSV:NANOS_TO_TIME($34) as time_colNANOS_TO_TIME(BIGINT($34)) as time_colJSON: NANOS_TO_TIME($a.b.c) as time_colNANOS_TO_TIME(BIGINT($a.b.c) as time_col |
TIMESTAMP(X) | If the input type is a CHAR, the TIMESTAMP function parses the string into a TIMESTAMP according to the format YYYY[-/]MM[-/]DD[T ]HH:MI[:SS[.NS]]. For the format, see the DATE and TIME functions. The input can have either the letter T or a space separating the date and time portions. The input can also end in Z to denote UTC or ±TZH[:TZM] to denote the time zone offset.If the input type is a BIGINT, the function parses the number with millisecond precision.If the input type is a DATE, the function converts the date into a TIMESTAMP with the value of the start of the day in UTC. | CHAR BIGINT DATE | TIMESTAMP | CSV:TIMESTAMP($35) as timestamp_colJSON: TIMESTAMP($a.b.c) as timestamp_colTIMESTAMP('2023-03-01 18:22:58.826') -> 2023-03-01T18:22:58.826TIMESTAMP(bigint(1651681788123456789)) -> 2022-05-04T16:29:48.123 |
TO_TIMESTAMP(X,F,FT) | Converts the specified value to a TIMESTAMP. X — The value to convert.F — (optional) The format string for the interpretation of the textual representation of X.FT — (optional) The format type. There are two format types: ocient and java. If you use the ocient type, the function uses the Ocient-based formatting function to parse the timestamp. For details, see TO_TIMESTAMP. If you use the java type, the function uses the DateTimeFormatter in Java to parse the timestamp. For details, see DateTimeFormatter (Java Platform SE 8 ). The default is the ocient format type with format string YYYY-MM-DD HH:MI:SS. If you specify the java format type, you must specify a valid DateTimeFormatter format string, as there is no default. | CHAR | TIMESTAMP | to_timestamp($n, 'YYYY-MM-DD HH:MI:SS.US')to_timestamp($n, 'DD.MM.YYYY HH:MI:SS.NS')to_timestamp($n, 'uuuu-mm-dd HH:mm:ss.SSSSSS', 'java') |
SECONDS_TO_TIMESTAMP(X) | Converts the specified value to a TIMESTAMP. This function creates a TIMESTAMP from a BIGINT that represents the number of nanoseconds after the Unix epoch (January 1, 1970). | BIGINT | TIMESTAMP | CSV:SECONDS_TO_TIMESTAMP($35) as timestamp_colJSON: SECONDS_TO_TIMESTAMP($a.b.c) as timestamp_colNANOS_TO_TIMESTAMP(bigint(1651681788123456789)) -> 2022-05-04T16:29:48 |
MILLIS_TO_TIMESTAMP(X) | Converts the specified value to a TIMESTAMP. This function creates a TIMESTAMP from a BIGINT that represents the number of milliseconds after the Unix epoch (January 1, 1970). Alias for TIMESTAMP(X) where X is the BIGINT type. | BIGINT | TIMESTAMP | MILLIS_TO_TIMESTAMP($35) as timestamp_col |
MICROS_TO_TIMESTAMP(X) | Converts the specified value to a TIMESTAMP. This function creates a TIMESTAMP from a BIGINT that represents the number of microseconds after the Unix epoch (January 1, 1970). | BIGINT | TIMESTAMP | MICROS_TO_TIMESTAMP($35) as timestamp_col |
NANOS_TO_TIMESTAMP(X) | Converts the specified value to a TIMESTAMP. This function creates a TIMESTAMP from a BIGINT that represents the number of nanoseconds after the Unix epoch (January 1, 1970). | BIGINT | TIMESTAMP | CSV:NANOS_TO_TIMESTAMP($35) as timestamp_colJSON: NANOS_TO_TIMESTAMP($a.b.c) as timestamp_colNANOS_TO_TIMESTAMP(bigint(1651681788123456789)) -> 2022-05-04T16:29:48.123456789 |
NOW() | Returns the current TIMESTAMP. | TIMESTAMP | now() as col_load_timestamp |
For the TO_TIMESTAMP function, the Ocient formatter currently does not support variable-length sections of timestamps (i.e., the data might have a different number of fractions of seconds across records).For example, assume this source data.
2024-01-01 12:00:00.1234562024-01-01 12:00:00.1234Then, you must specify the format_type as java and enter the DateTimeFormatter format string to parse the timestamps.Character Data Transformation Functions
Character data transformation functions operate onVARCHAR data to transform a string of data.
| Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
|---|---|---|---|---|
CONCAT(arg1, arg2[, ...]) | Returns a string that is the result of concatenating the arguments provided. Accepts two or more arguments to concatenate. If one of the arguments to CONCAT is NULL, then CONCAT returns NULL.ℹ️ CONCAT does not support HASH or BINARY arguments in loading. | arg1: CHARarg2: CHARarg3… argN (optional) CHAR | CHAR | CONCAT($1, $2)CONCAT(TRIM($first_name), '_', TRIM($last_name)) |
SUBSTRING(X, P, L) | Extract a substring from X beginning at the position offset P with a length L.X is the source data in CHAR or BINARY data type.P is the numeric byte position in the record that starts at index 1, which indicates the position to begin reading.L is the length of data to return starting from position P.For VARCHAR data, by default, the Ocient System treats each byte in a source field reference (e.g., $your_field) as an ASCII-encoded character, not multi-byte character data. To handle multi-byte character data such as UTF-8 in a source field, you must cast X as CHAR($your_field) or use another function that returns CHAR data.P and L represent the character offset and the number of characters to retrieve from the source data. | X: CHAR or BINARYP: INTL: INT | Return type matches the type of X:* CHAR * BINARY | substring($order.postcode, 1, 5) as col_postcode |
TRIM(arg1[, arg2])BTRIM(arg1[, arg2]) | Trims the leading and trailing space characters (or optionally, characters specified in the arg2 argument) from the arg1 argument. | arg1: CHARarg2 ( optional) must be a CHAR of length 1. | CHAR | TRIM('__testTrimFunction__', '_') -> 'testTrimFunction'BTRIM(' testBtrimFunction ') -> 'testBtrimFunction' |
REPLACE(arg1, arg2, arg3) | The arg1 argument is the input string for the replace operation.The arg2 argument is the pattern to replace or the target string.The arg3 argument is the pattern for the replacement or the replacement string.This function replaces occurrences of arg2 in arg1 with arg3. | arg1: CHARarg2: CHAR arg3: CHAR | CHAR | REPLACE('testReplaceFunctionXYZ', 'X', 'Z') -> 'testReplaceFunctionZYZ' |
LENGTH(arg1)CHAR_LENGTH(arg1)CHARACTER_LENGTH(arg1) | These functions perform the same operation of returning the character length of the input string. arg1 is the input string.⚠️ When you use these functions with the SUBSTRING, RIGHT, or LEFT function, you might experience unintended results for multi-byte characters, such as Arabic in UTF-8 if you do not cast the field reference to CHAR. | CHAR | INT | LENGTH('hello world') -> 11 |
UPPER(arg1)UCASE(arg1) | Converts the characters in the input string to all uppercase characters. | CHAR | CHAR | UPPER('Hello there') -> 'HELLO THERE' |
LOWER(arg1)LCASE(arg1) | Converts the characters in the input string to all lowercase characters | CHAR | CHAR | LOWER('Obi Wan Kenobi') -> 'obi wan kenobi' |
LOCATE(arg1, arg2[, arg3])POSITION(arg1, arg2[, arg3]) | Returns the index position of the first occurrence of the character value arg1 in character value arg2.Optionally, you can also include arg3 to offset the search by the specified number of spaces. | arg1: CHARarg2: CHARarg3: LONG | INT | LOCATE('F', 'testLocateFunction') -> 11POSITION('on', 'testPositionFunction', 12) -> 19 |
LEFT(arg1, arg2) | Returns the number of characters in the arg1 string equal to the arg2 integer value. If the integer is negative, the function returns all characters except the last number of characters as specified by the arg2 value.This function treats source field references (e.g., $your_field) as single-byte ASCII characters by default, similar to the SUBSTRING function. | arg1: CHARarg2: LONG | CHAR | LEFT('testLeftFunction', 4) -> 'test'LEFT('testLeftFunction', -4) -> 'testLeftFunc' |
RIGHT(arg1, arg2) | Returns the number of trailing characters in the arg1 string equal to the arg2 integer value. If the integer is negative, the function returns all characters except the first number of characters as specified by the arg2 value.This function treats source field references (e.g., $your_field) as single-byte ASCII characters by default, similar to the SUBSTRING function. | arg1: CHARarg2: LONG | CHAR | RIGHT('green tea', 3) -> 'tea'RIGHT('coffee beans', -7) -> 'beans |
TO_CHAR(value, format) | Converts the binary value specified by value into a character value by performing an encode or decoding using the format value specified by format. | value — BINARY(N) or VARBINARYformat — CHAR with these supported values:'hex' — Encodes into a hexadecimal string with no 0x prefix and lowercase letters. 'base64' — Encodes into a base64 string with padding. This encoding is not URL-safe and does not include line breaks. 'char' — Decodes into a string with the specified CHAR encoding, which can be values such as: 'utf-8', 'utf-16le', 'utf-16be', or iso-8859-1. | CHAR | TO_CHAR(BINARY('0x6f6369656e74'), 'hex') -> '6f6369656e74'TO_CHAR(BINARY('0x6f6369656e74'), 'utf-8') -> 'ocient' |
HEX(value) | Alias for TO_CHAR(value, 'hex') | value — BINARY(N) or VARBINARY | CHAR | HEX(BINARY('0x6f6369656e74')) -> '6f6369656e74' |
TO_HASH(value, length, format) | Converts the string value specified by value into a fixed length binary value with lengh specified by length by encoding or decoding using the format value specified by format. | value — CHARlength — INTformat — CHAR with these supported values:'hex' — Decodes from a hexadecimal string with optional 0x prefix and case-insensitivity.'base64' — Decodes from a base64 string with optional padding. This decoding is not URL-safe and does not expect line breaks. char — Encodes into a string with the specified CHAR encoding, which can be values such as: 'utf-8', 'utf-16le', 'utf-16be', or iso-8859-1. | HASH(arg2) | TO_HASH('6f6369656e74', 6, 'hex') -> 0x6f6369656e74TO_HASH('ocient', 6, 'utf-8') -> 0x6f6369656e74 |
TO_VARBINARY(value, format) | Converts the string value specified by value into a variable length binary value by encoding or decoding using the format value specified by format. | See TO_HASH description. | VARBINARY | TO_VARBINARY('6f6369656e74', 'hex') -> 0x6f6369656e74TO_VARBINARY('ocient', 'utf-8') -> 0x6f6369656e74 |
UNHEX(value) | Alias for TO_VARBINARY(value, 'hex') | value: CHAR | VARBINARY | UNHEX('6f6369656e74') -> 0x6f6369656e74 |
Binary Data Transformation Functions
Binary data transformation functions operate on bytes provided when you use theBINARY format type in a pipeline. These functions are typically used in processing mainframe data outputs.
| Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
|---|---|---|---|---|
FLOAT(bytes [, endianness]) | With a byte array in the BINARY format, parse bytes in standard 754 binary32 (single-precision) format. | bytes is a 4-byte array extracted from BINARY data.endianness (optional) must be a literal string. Valid values include 'big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition. | FLOAT | BINARY:FLOAT($"[5, 4]") as float_col1FLOAT($"[5, 4]", 'little') as float_col2 |
DOUBLE(bytes [, endianness]) | With a byte array in the BINARY format, parse bytes in standard IEEE 754 binary64 (double-precision) format. | bytes is an 8-byte array extracted from BINARY data.endianness (optional) must be a literal string. Valid values include 'big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition. | DOUBLE | BINARY:DOUBLE(FLOAT($"[5, 8]")) as double_col1DOUBLE($"[5, 8]", 'big') as double_col2 |
TINYINT(bytes [[, signed], endianness]) | Converts a 1-byte array that represents a signed or unsigned number to the TINYINT format. ⚠️ If you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column, the function throws an error. | bytes is a single-byte array extracted from BINARY data.signed is a string literal that indicates if the bytes represent a signed or unsigned BINARY format. Valid values include 'signed' and 'unsigned'. The default value is 'signed'.endianness is a string literal that indicates whether the bytes are in the big or little endian format. Valid values include ‘big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition. | TINYINT | BINARY:TINYINT($"[9, 1]") as tiny_1TINYINT($"[9, 1]", 'unsigned') as tiny_2TINYINT($"[9, 1]", 'signed', 'little') as tiny_3 |
SMALLINT(bytes [[, signed], endianness]) | Converts a 2-byte array that represents a signed or unsigned number to the SMALLINT format. ⚠️ The function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column. | bytes is a 2-byte array extracted from BINARY data.signed is a string literal that indicates if the bytes represent a signed or unsigned BINARY format. Valid values include 'signed' and 'unsigned'. The default value is 'signed'.endianness is a string literal that indicates whether the bytes are big or little in endian format. Valid values include ‘big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition. | SMALLINT | BINARY:SMALLINT($"[9, 2]") as small_1SMALLINT($"[9, 2]", 'unsigned') as small_2SMALLINT($"[9, 2]", 'signed', 'little') as small_3 |
INT(bytes [[, signed], endianness]) | Converts a 4-byte array that represents a signed or unsigned number to the INT format. ⚠️ The function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column. | bytes is a 4-byte array extracted from BINARY data.signed is a string literal that indicates if the bytes represent a signed or unsigned BINARY format. Valid values include 'signed' and 'unsigned'. The default value is 'signed'.endianness is a string literal that indicates whether the bytes are big or little endian format. Valid values include ‘big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition. | INT | BINARY:INT($"[9, 4]") as int_1INT($"[9, 4]", 'unsigned') as int_2INT($"[9, 4]", 'signed', 'little') as int_3 |
BIGINT(bytes [[, signed], endianness]) | Converts an 8-byte array that represents a signed or unsigned number to the BIGINT format. ⚠️ The function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column. | bytes is an 8-byte array extracted from BINARY data.signed is a string literal that indicates if the bytes represent a signed or unsigned binary format. Valid values include 'signed' and 'unsigned'. The default value is 'signed'.endianness is a string literal that indicates whether the bytes are big or little endian format. Valid values include ‘big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition. | BIGINT | BINARY:BIGINT($"[9, 8]") as bigint_1BIGINT($"[9, 8]", 'unsigned') as bigint_2BIGINT($"[9, 8]", 'signed', 'little') as bigint_3 |
DECIMAL(bytes, format, precision, scale [, sign_format]) | Convert a byte array that represents packed decimal or zoned decimal BINARY data to the DECIMAL format.With this transformation, you can indicate the format of the BINARY data as either 'zoned' or 'packed'. | bytes is an array of bytes extracted from BINARY data.format is a string literal that indicates the data format. Valid values include 'packed' (meaning the array contains decimal data in each nibble for two digits per byte) and 'zoned' (meaning the array contains decimal data in each nibble for one digit per byte).precision is the total number of significant digits.scale is the number of digits after the decimal point.sign_format is a string literal that indicates how the transformation should extract sign information from the bytes. Valid values include 'leading' (means that the system stores the sign in the upper nibble of the first byte for both 'zoned' and 'packed' data formats), 'leading_separate' (means the system stores the sign as a separate new first byte from the digit data), 'trailing' (means that the system stores the sign in the upper nibble of the last byte when the data format is set to 'zoned' and in the lower nibble of the last byte when set to 'packed'), or 'trailing_separate' (means the system stores the sign as a separate new last byte from the digit data). For 'leading' and 'trailing' sign information, accepted sign values are 0x7, 0xB, and 0xD for negative signs and 0xA, 0xC, 0xE, and 0xF for positive. The accepted sign values for 'leading_separate' and 'trailing_separate' sign information are 0x2D for - negative signs and 0x2B for + positive. The default value for sign_format is 'trailing' for both format data formats. | DECIMAL | BINARY:DECIMAL($"[1, 10]", 'packed', 4, 2) as revenueDECIMAL($"[1, 10]", 'zoned', 4, 2, 'leading') as revenue |
Array Data Transformation Functions
| Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
|---|---|---|---|---|
ARRAY_CAP(A,N) | Restrict the length of an array to a maximum number of elements. | A is an array of any type.N is the maximum number of elements for the length of the specified array. | Array of any type | ARRAY_CAP(A, 1000) |
ARRAY_COMPACT(array) | Removes NULL values from the array. | array is an array that contains any type of elements. The array can be single or multi-dimensional. | Array of any type | ARRAY_COMPACT($ints[]) |
ARRAY_CONTAINS(array, value) | Returns true if the array contains the specified value.If the specified value is NULL, then this transform function returns NULL. | array is an array that contains any type of elements. The array can be single or multi-dimensional.value is an integer. | Boolean or NULL | ARRAY_CONTAINS($ints[], 1) |
ARRAY_DISTINCT(array) | Return all unique values in an array. For details, see ARRAY_DISTINCT. | array is an array that contains any type of elements. The array can be single or multi-dimensional. | Array of any type | ARRAY_DISTINCT($ints[]) |
ARRAY_LENGTH(array,dim) | Returns the length of the array for the specified dimension dim, which starts at index 1. If you specify a dimension that is greater than the size of the array, the function returns NULL. | array is an array that contains any type of elements. The array can be single or multi-dimensional.dim is an integer. | Integer | With $arr containing [[1,2,3], [4,5,6]], ARRAY_LENGTH($arr,2) returns 3 for the second dimension of the array. |
ARRAY_SORT(array)ARRAY_SORT(array,function) | ARRAY_SORT(array) sorts and returns the input array based on the natural ordering of its elements.If one of the array elements is NULL, then this transform function sorts the NULL values to the end of the array. ARRAY_SORT(array,function) sorts and returns the input array based on the results of the specified Lambda function.The function should have two arguments representing two elements of the array. This function should return a negative integer, 0, or a positive integer if the first element is less than, equal to, or greater than the second element, respectively. For details, see ARRAY_SORT. | For the first syntax, array is an array of elements with these types: BIGINT, BOOLEAN, BYTE, DATE, DOUBLE, FLOAT, INT, SMALLINT, UUID, or VARCHAR.For the second syntax, array is an array that contains any type of elements. The array can be single or multi-dimensional. The function argument is a Lambda function with the format (x T, y T) -> INT or another SQL reference function. | Array of any type | ARRAY_SORT($"1d"[], (x int, y int) -> CASE WHEN x = y THEN 0 WHEN INT(COALESCE(x, 1000)) < INT(COALESCE(y, 1000)) THEN -1 ELSE 1 END) sorts the array in 1d column using the Lambda function with a CASE statement. The CASE statement returns 0 when x and y are equal, returns -1 when the integer cast of the non-NULL value x is less than the corresponding value for y, and returns 1 otherwise. |
ELEMENT_AT(array,I) | Returns the element value of the array or tuple at the specified index. You can specify a negative index value as long as the absolute value of the index does not exceed the length of the array or tuple. If the index is out of bounds, the function returns NULL. | array is an array or tuple.I is a 1-based index, specified as an integer. | Any type. | $a accesses the array [1,2,3].ELEMENT_AT($a,1) returns 1.ELEMENT_AT($a,-1) returns 3.ELEMENT_AT($a,-2) returns 2.ELEMENT_AT($a,-4) returns NULL. |
FLATTEN(array) | Transforms an N-dimensional array into an N-1-dimensional array. If one of the inner array elements is NULL, this transform function returns NULL. | array is a multi-dimensional array, for example T[][], that contains any type of elements. | Array of any type that can be single or multi-dimensional. | FLATTEN($ints[][]) |
REDUCE(array, start_value, merge_function, finish_function) | Applies a merge function to a starting value and all elements in the array, and then reduces the array to a single value. Optionally, specify a finish function for the returned single value. The function applies the merge function to each element of the array with the start value to return a final value. | array is an array of any type and can be a single or multi-dimensional array.start_value is the starting value of any type.merge_function is a Lambda function or any reference function. This Lambda function has the form (x U, y T) -> U.finish_function is a Lambda function or any reference function. This Lambda function has the form (x U) -> V. | This function returns the value of type V, which can be any type, from the finish function.If you do not specify the finish function, this function returns the type U, which can be any type, from the merge function. | Assuming ints[]is INT[]of [1, 2, 3],REDUCE($ints[], 0, (acc int, x int) -> acc + x)returns the value 6.Assuming ints[]is INT[]of [1, 2, 3],REDUCE($ints[], 0, (acc int, x int) -> acc + x, BOOLEAN)returns the true value. In this case, the finish function is the BOOLEAN reference function with the input value being the resulting value from the merge function: BOOLEAN(6). |
TO_ARRAY_LENGTH(V,N) | Create the specified number of copies of any JSON object in an array. | V is any JSON object to copy into the result array.N is the number of copies to make in the result array. | Array of any type | TO_ARRAY_LENGTH(V, 3) |
Logical Operations Transformation Functions
These transformation functions execute logical operations on the data in the data pipeline.| Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
|---|---|---|---|---|
IF(X,T,F) | Returns T if the expression X evaluates to true, or the function returns F if X evaluates to false.IF and IFF are function aliases. | Any type | Any type | IF('$2', '1', '2') AS output |
CASE WHEN expression1 THEN result1WHEN expression2 THEN result2...ELSE resultEND | Returns the result value based on whether an expression is true. If none of the expressions evaluate to true, then the function returns the result value after the ELSE keyword. | Any type | Any type | CASE WHEN $x THEN 'True was input in' ELSE 'False was input in' END AS toutput |
NULL_IF(X, Y) | Returns the NULL value if the two input arguments are equal.If the arguments are not equal, the function returns X. | Any type | Any type | INT(NULL_IF($32, ''))INT(NULL_IF($a.b.c, '')) |
Network Type Functions
Network type functions are also supported in loading. For example,SUBNET(IP($9),13) AS ipv6_subnet_form1 in the CREATE PIPELINE SELECT SQL statement converts the input data in the column to an IPv6-mapped IPv4 address and then computes the prefix of size 13 for that address. For the supported network type functions, see Network Type Functions.
Other Data Transformation Functions
Other data transformation functions include the extraction of values from the JSON format and miscellaneous functions such asCOALESCE, which are commonly used when working with data pipelines.
| Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
|---|---|---|---|---|
COALESCE(arg1, arg2) | Returns arg1 if it is non-NULL; otherwise, the function returns arg2.arg1 and arg2 can be any data type, but their types must match. | Any type | Matches arg1 and arg2 type | COALESCE($1, $2)COALESCE(int($1), int($2)) COALESCE(bigint($1), 13) COALESCE(substring(char($1), 1, 2), TRIM($2)) |
JSON_EXTRACT_PATH_TEXT(X, D) | Extracts a value from a JSON string. | X is the JSON string for extracting a value in CHAR type. D is the key of the value to extract in CHAR type. | CHAR | JSON_EXTRACT_PATH_TEXT(JSON_UNESCAPE($real_field_key), 'b') AS col_char |
MAP_KEYS(X) | Returns the keys in the specified JSON string. | X is the JSON string. | ARRAY OF STRINGS | Input JSON string a:'{"name":"John", "age":30, "car":null}'MAP_KEYS($a)Output: ["age", "car", "name"] |
MAP_VALUES(X) | Returns the values in the specified JSON string. | X is the JSON string. | ARRAY OF STRINGS | Input JSON string a:'{"name":"John", "age":30, "car":null}'MAP_VALUES($a)Output: ["John", "30", null] |
DEFAULT Keyword
You can use theDEFAULT keyword to include the configured default value of a column in transformation functions.
Example
SQL
Lambda Functions
Lambda functions are a transformative tool in data pipelines, enabling declarative and highly flexible data transformations. By supporting per-element operations on arrays and nested data structures, lambda functions streamline data ingestion and reduce the need for post-processing. Their adoption enables you to handle complex, real-time data engineering scenarios natively. Lambda functions enable you to apply custom logic to each element of an array during ingestion. They support advanced operations like filtering and aggregation. You can embed transformation logic directly in the pipeline definition. This embedding reduces the need for additional processing steps after the load.| Expression | Description |
|---|---|
| Lambda Function Expression (T1,…,Tn) -> U | A lambda function is a temporary function that takes 0 or more named arguments as a parenthesis-enclosed list and returns an expression. The expression can reference objects defined within the closure of the expression, which includes locally named arguments that are locally defined within an enclosing scope, field references, literals, etc. For examples, see Special Data Pipeline Transformation Functions. |
| Reference Function Expression UPCASE = UPCASE(x) -> CHAR tuple = tuple(T1,…,Tn) -> tuple<<T1,…,Tn>> | A reference function points to an existing function in the Ocient function list. Specifying the function name allows you to use that function in contexts where you would otherwise need to create a full lambda function. For examples, see Special Data Pipeline Transformation Functions. |
User-Defined Data Pipeline Functions
You can create a user-defined data pipeline function using theCREATE PIPELINE FUNCTION SQL statement and Groovy. After you create data pipeline functions, you can execute them in data pipelines like any other function to transform records.
For more details, see CREATE PIPELINE FUNCTION and Data Types for User-Defined Data Pipeline Functions.

