Transform Data in Data Pipelines
Data pipeline functionality supports data transformation during loading to efficiently convert and clean source data for your chosen table schema. You can use two broad classes of data transformation: source field references or SQL transformation functions.
When you load data using a data pipeline, you reference extracted data using a special syntax named a source field reference. These expressions are either a named reference such as $my_field or a numeric field reference such as $1.
In structured data formats like JSON, more complex source field references allow you to transform the structured data into the required form. For example, you can extract an array of values from an array of nested objects with an expression such as $data.items[].price.
To learn more about transforming data with source field references, see:
Data pipeline loading supports numerous SQL functions for use in transforming data during loading. You can use these transformation functions in the INTO ... SELECT portion of a pipeline. Transformation functions generally match the syntax and behavior of the function you use when you query in . The behavior might differ from querying behavior in two ways:
- Some additional functions exist that provide behavior that does not exist in queries.
- Some existing functions have extended behavior during loading that provides important capabilities that are not in queries.
The types of transformation functions supported in loading include:
- Scalar data conversion and casting of data types
- Special loading transformation functions
- Date and time transformations
- Character data transformations
- Binary data transformations
- Other data transformations
Scalar transformation functions include constructors, scalar data conversions, and data-casting functions.
Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
---|---|---|---|---|
T[](X) | Casts a string representation of an array of values or the result of a JSON array extraction expression to the corresponding array with data type T where T is a SQL data type. Multi-dimensional arrays are supported. To indicate an N-dimensional array cast, use N sets of square brackets in the casting function. All transformation functions can only apply to an array as a whole. To apply functions on an element-wise basis, see TRANSFORM in Special Data Pipeline Transformation Functions. | CHAR | ARRAY[] | CSV: DECIMAL(8,4)[][] ($21) as decimal_2d_array_col JSON: CHAR[]($a.b.names[]) as names DECIMAL(8,2)[][]($a[].b.c[].prices) as prices CHAR[][][]($names[][][]) as names
|
BIGINT(X) | Converts the specified type to a 64-bit signed integer. |
| BIGINT | CSV: BIGINT($22) as int64_col JSON: BIGINT($a.b.c) as int64_col |
BINARY(X) | Translates a string representation of bytes and converts them to real bytes. The string representation expects hexadecimal representations that convert to the matching bytes with this function. Hexadecimal data is not case-sensitive. Example source data: a891, F1D890 | CHAR | BINARY or VARBINARY | CSV: BINARY($30) as varbinary_col JSON: BINARY($a.b.c) as varbinary_col |
BOOLEAN(X) | Handles textual representation of Booleans and converts them to its Boolean equivalent. You can represent true as any of these values: “t”, “true”, “yes”, “1”, “y”, “on” false representation as any of these values: “f”, “false”, “no”, “0”, “n”, “off” | CHAR | BOOL | CSV: BOOLEAN($23) as bool_col JSON: BOOLEAN($a.b.c) as bool_col |
BYTE(X) | Supports characters and number or byte array representations. Interprets character data as the ASCII numeric representation of a byte. If you specify a byte array, the Ocient System uses the first element of the array. When the system converts text data to a byte, the function throws an error if the value is > 127 or < -128. Example: '101' -> 101 '-23' -> -23 | CHAR | BYTE | CSV: BYTE($24) as byte_col JSON: BYTE($a.b.c) as byte_col |
CHAR(X) | Converts the specified input type to its string representation. |
| CHAR or VARCHAR | CSV: CHAR($25) as varchar_col CHAR($26) as char_col JSON: CHAR($a.b.c) as varchar_col CHAR($a.b.c) as char_col CHAR($array_field[]) as char_col |
DECIMAL(X,P,S) | X is input data to convert to a decimal. P is the precision of the decimal (number of significant digits) and S is the scale (number of digits after the decimal point). You can omit P and S or only S. In that case, the Ocient System defaults P to 31 and S to 16. Example: DECIMAL(30.0,4,2) -> 30.00 | CHAR | DECIMAL | CSV: DECIMAL($27, 8, 4) as decimal_col JSON: DECIMAL($a.b.c, 8, 4) as decimal_col |
DOUBLE(X) | Converts the specified type to its double representation. |
| DOUBLE | CSV: DOUBLE($28) as double_col JSON: DOUBLE($a.b.c) as double_col |
FLOAT(X) | Converts the specified type to its float representation. |
| FLOAT | CSV: FLOAT($29) as float_col JSON: FLOAT($a.b.c) as float_col |
HASH(X,N) | Translates a string representation of bytes and converts them to bytes. The string representation expects hexadecimal representations that convert to the matching bytes with this function. Hexadecimal data is not case-sensitive. X is the source data, such as: a891, F1D890 N is the fixed number of bytes in the binary array. | CHAR | HASH | CSV: HASH($23, 9) as fixed_hash_binary_col JSON: HASH($a.b.c, 9) as fixed_hash_binary_col |
INT(X) | Converts the specified type to a 32-bit signed integer. |
| INT | CSV: INT($31) as int32_col JSON: INT($a.b.c) as int32_col |
IP(X) | Converts the specified string to an IPv6 address equivalent. X is the source data, such as: 2001:0db8:1234:5678:abcd:ef01:2345:6789 2001:db8:0:0:0:0:0:2 192.0.2.1 | CHAR | IP | CSV: IP($32) as ip_col JSON: IP($a.b.c) as ip_col |
IPV4(X) | Converts the specified string to an IPv4 address equivalent. X is the source data, such as: 192.0.2.1 | CHAR | IPV4 | CSV: IPV4($32) as ipv4_col JSON: IPV4($a.b.c) as ipv4_col |
SMALLINT(X) | Converts the specified value to its 16-bit signed integer representation. |
| SMALLINT | CSV: SMALLINT($33) as int16_col JSON: SMALLINT($a.b.c) as int16_col |
ST_POINT(Long, Lat) | Accepts a longitude and latitude coordinate for a geospatial point. Arguments must be of type double or string values that can convert to double. For WKT representation, see ST_POINTFROMTEXT.
|
| ST_POINT | CSV: ST_POINT($1, $2) as st_point_col JSON: ST_POINT($a.b.c, $d.e.f) as st_point_col
|
ST_POINTFROMTEXT(X) | Accepts the WKT text representation for ST_POINT with X and Y coordinates only. Example WKT representation: 'POINT(-40.5318 11.8532)' | CHAR | ST_POINT | st_pointfromtext($1) st_pointfromtext($a.b.c) |
TINYINT(X) | Converts the specified value to an 8-bit signed integer. | CHAR | TINYINT | CSV: TINYINT($34) as int8_col JSON: TINYINT($a.b.c) as int8_col |
TUPLE <<T,[T]*>>(X,[Y]*) | Creates a tuple from the specified types. TUPLE supports arrays of tuples, but does not support tuples within tuples. The arguments to the TUPLE casting function can be either a single source value or must match the number of types defined in the TUPLE. If you specify a single value, the Ocient System infers individual elements of the tuple using the TUPLE format (value1,value2,value3). Example source data: (123,12.75) is valid text data for a TUPLE<<INT,DECIMAL(7,2)>>. | CHAR | TUPLE of type | CSV: TUPLE<<INT,DECIMAL(7,2),TIMESTAMP,BINARY(7)>>($37) as int_decimal_timestamp_binary_tuple_col TUPLE<<INT,DECIMAL(7,2),TIMESTAMP,BINARY(7)>>($4,$8,$1) as int_decimal_timestamp_binary_tuple_col_multi_input JSON: TUPLE<<INT,DECIMAL(7,2),TIMESTAMP,BINARY(7)>>($a, $b, $c)) as int_decimal_timestamp_binary_tuple_col |
UUID(X) | Creates a unique identifier from the specified value. Example source data: '237e9877-e79b-12d4-a765-321741963000' | CHAR | UUID | CSV: UUID($34) as uuid_col JSON: UUID($a.b.c) as uuid_col |
Special data pipeline transformation functions include functions that only exist in loading, such as the METADATA function to load special data during pipeline execution.
Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
---|---|---|---|---|
METADATA(key) | Extracts the metadata value for the specified key from available metadata for the pipeline. For supported key values, see Load Metadata Values in Data Pipelines. | CHAR | Varies based on the metadata key. | METADATA('filename') → '/path/to/file/orders_1830374.json' |
EXPLODE_OUTER(array) | Expands the contents of the specified one-dimensional array. By default, the array can contain a maximum of 8,191 elements. For supported options, see Special Data Pipeline Transformation Functions. | ARRAY | Elements of the array in their own row. | EXPLODE_OUTER($"data.response.responses[].own") AS own |
PARSE_DELIMITED_ARRAY(text, rank, array_element_delimiter, bracket_spec) | Converts a string of text data representing an array into a CHAR[]. For example, with data in the form [1;2;3], executing PARSE_DELIMITED_ARRAY('[1;2;3]', 1, ';', '[]') returns a CHAR[] with the value ['1','2','3']. | text is CHAR data that represents the text to convert to an array. rank is an optional argument that represents the numeric rank of the output array. The default value is 1. You can only specify rank as a literal value because you cannot retrieve it dynamically from a source field selector. array_element_delimiter is an optional argument that contains the control character for delimiting array elements. For example, ',' indicates that a comma separates each element in the text character data. The default value definition comes from the ARRAY_ELEMENT_DELIMITER value of the data pipeline. bracket_spec is an optional argument that is the pair of control characters for the open and close array brackets. For example, '[]' indicates that each rank of the array is enclosed in square brackets. The default definition comes from the OPEN_ARRAY and CLOSE_ARRAY characters of the data pipeline. Valid options are: '', '[]', '{}', '()', and '<>'. | CHAR[] | PARSE_DELIMITED_ARRAY($text_array) PARSE_DELIMITED_ARRAY($text_array, 1, ';', '[]') PARSE_DELIMITED_ARRAY($two_dim_text_array, 2, ',', '[]') |
TRANSFORM(T[], (T) -> U) | Transforms an array based on the logic in a lambda expression. The first argument is an expression that evaluates to an array that contains the elements to transform. The second argument is a lambda expression (or function reference) that defines how to transform each element. TRANSFORM returns an array of elements of the same length as the array argument of the type returned by the lambda function. The second argument must be of the form (x T) -> U . This returns a value of type U for a specified element x. The type T of the element x is specified in the lambda function. The second argument can also be a function reference for a single argument function. For supported options and examples, see Special Data Pipeline Transformation Functions. | ARRAY Lambda Function (x) -> U or Function Reference | Elements of the array transformed to the return value and type of the anonymous function. | TRANSFORM($data.names[], (el CHAR) -> LOWER(el)) as lower_case_name_array TRANSFORM($data.names[], LOWER) as lower_case_name_array |
FILTER(T[], (T) -> Boolean) | Filters elements in an array based on the logic in a lambda expression. The first argument is an expression that evaluates to an array that contains the elements to filter. The second argument is a lambda expression (or function reference) that returns true if an element should be retained. The second argument must be of the form (x T) -> BOOL. This returns a boolean value for a specified element x. The type T of the element x is specified in the lambda function. Elements in the first argument array are included in the resulting array when the lambda function returns true and excluded otherwise. For supported options and examples, see Special Data Pipeline Transformation Functions. | ARRAY Lambda Function (x T) -> BOOL or Function Reference | The elements of the array that matched the filter criteria function. | FILTER($data.names, (name CHAR) -> length(name) > 5) |
ZIP_WITH(T1[], ..., Tn[], (T1, ..., Tn) -> U) | Merges two or more input arrays, element-wise, into a single array using the trailing combining function. If one array is shorter than the other(s), nulls are appended at the end to match the length of the longest array before applying. Returns an array with elements that result from applying the function func to each pair of elements from array1 and array2. If arrays are not the same length, the shorter array is padded with NULL values. func must be of the form (x T, y U) -> V. x is the element from array1, y is the element from array2. Types T and U are specified in the lambda function. For supported options and examples, see Special Data Pipeline Transformation Functions. | ARRAY ARRAY Lambda Function (x T, y U) -> V or Function Reference | An array of length max(length(array1), length(array2)) containing elements of type V as returned by the lambda function. | ZIP_WITH($line_items[].product_name, $line_items[].sku, (name CHAR, sku CHAR) -> concat(name, '[', sku, ']')) |
Date and time transformation functions provide utilities to parse, convert, or transform DATE, TIME, and TIMESTAMP values in pipelines.
Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
---|---|---|---|---|
DATE(X) | Converts the specified value to a DATE value. This function always creates dates from a timestamp in UTC time. If the argument is a CHAR type, the function assumes that the value has a textual representation in the YYYY-MM-DD format. Supports any date separator character. For example, YYYY:MM:DD is also valid. |
| DATE | CSV: DATE($24) as date_col DATE(TIMESTAMP($24)) as date_col JSON: DATE($a.b.c) as date_col DATE(TIMESTAMP($a.b.c)) as date_col |
INT_TO_DATE(arg) | Converts the specified value to a DATE. This function creates a DATE type from an INT type representing the number of days after the epoch (January 1, 1970). | INT | DATE | INT_TO_DATE($1) AS col_date_1 |
MICROS_TO_TIME(arg) | Converts the specified value to the TIME type, treating the argument with microsecond precision. | BIGINT | TIME | MICROS_TO_TIME($1) AS col_time |
MICROS_TO_TIMESTAMP(X) | Converts the specified value to a TIMESTAMP. This function creates a TIMESTAMP from a BIGINT that represents the number of microseconds after the Unix epoch (January 1, 1970). | BIGINT | TIMESTAMP | MICROS_TO_TIMESTAMP($35) as timestamp_col |
MILLIS_TO_TIME(arg) | Converts the specified value to the TIME type, treating the argument with millisecond precision. Alias for TIME(X) where X is the BIGINT type. | BIGINT | TIME | MILLIS_TO_TIME($1) AS col_time |
MILLIS_TO_TIMESTAMP(X) | Converts the specified value to a TIMESTAMP. This function creates a TIMESTAMP from a BIGINT that represents the number of milliseconds after the Unix epoch (January 1, 1970). Alias for TIMESTAMP(X) where X is the BIGINT type. | BIGINT | TIMESTAMP | MILLIS_TO_TIMESTAMP($35) as timestamp_col |
NANOS_TO_TIME(X) | Converts the specified value to a TIME value. If the argument has the BIGINT type, the function treats the number with nanosecond precision. | BIGINT | TIME | CSV: NANOS_TO_TIME($34) as time_col NANOS_TO_TIME(BIGINT($34)) as time_col JSON: NANOS_TO_TIME($a.b.c) as time_col NANOS_TO_TIME(BIGINT($a.b.c) as time_col |
NANOS_TO_TIMESTAMP(X) | Converts the specified value to a TIMESTAMP. This function creates a TIMESTAMP from a BIGINT that represents the number of nanoseconds after the Unix epoch (January 1, 1970). | BIGINT | TIMESTAMP | CSV: NANOS_TO_TIMESTAMP($35) as timestamp_col JSON: NANOS_TO_TIMESTAMP($a.b.c) as timestamp_col NANOS_TO_TIMESTAMP(bigint(1651681788123456789)) -> 2022-05-04T16:29:48.123456789 |
NOW() | Returns the current TIMESTAMP. | none | TIMESTAMP | now() as col_load_timestamp |
SECONDS_TO_TIME(arg) | Converts the specified value to the TIME type, treating the argument with second precision. | BIGINT | TIME | SECONDS_TO_TIME($1) AS col_time |
SECONDS_TO_TIMESTAMP(X) | Converts the specified value to a TIMESTAMP. This function creates a TIMESTAMP from a BIGINT that represents the number of nanoseconds after the Unix epoch (January 1, 1970). | BIGINT | TIMESTAMP | CSV: SECONDS_TO_TIMESTAMP($35) as timestamp_col JSON: SECONDS_TO_TIMESTAMP($a.b.c) as timestamp_col NANOS_TO_TIMESTAMP(bigint(1651681788123456789)) -> 2022-05-04T16:29:48 |
TIME(X) | Converts the specified value to a TIME value. If the argument has the BIGINT type, the function treats the number with millisecond precision. If the argument is a CHAR type, the function assumes that the value has a textual representation in the HH:mm:ss format. Supports any separation character. For example, HH/mm/ss is also valid. |
| TIME | CSV: TIME($34) as time_col TIME(BIGINT($34)) as time_col JSON: TIME($a.b.c) as time_col TIME(BIGINT($a.b.c) as time_col |
TIMESTAMP(X) | Converts the specified value to a TIMESTAMP. This function creates a TIMESTAMP from a BIGINT that represents the number of milliseconds after the Unix epoch (January 1, 1970). When you convert from the DATE type, the function defaults the TIMESTAMP value to the start of the day in UTC time. If the argument is a CHAR type, the function assumes that the value has a textual representation in the YYYY-MM-DD HH:mm:ss[.SSSSSSSSS] format. In this case, the nanoseconds are optional. The delimiters in the date-time value cannot be in a different format than the default value. |
| TIMESTAMP | CSV: TIMESTAMP($35) as timestamp_col JSON: TIMESTAMP($a.b.c) as timestamp_col TIMESTAMP('2023-03-01 18:22:58.826') -> 2023-03-01T18:22:58.826 TIMESTAMP(bigint(1651681788123456789)) -> 2022-05-04T16:29:48.123 |
TO_DATE(X,F) | Converts the specified value to a DATE. X — The value to convert. F — (optional) The format string for the interpretation of the textual representation of X. The default format is YYYY-MM-DD. For details, see TO_DATE. | CHAR | DATE | to_date($3, 'YYYY/MM/DD') to_date($a.b.c, 'DD.MM.YYYY') |
TO_TIME(X,F) | Converts the specified value to a TIME value. X — The value to convert. F — (optional) The format string for the interpretation of the textual representation of X. The default format is HH:MI:SS. | CHAR | TIME | to_time($n, 'HH:MI:SS') to_time($n, 'HHMISS') |
TO_TIMESTAMP(X,F,FT) | Converts the specified value to a TIMESTAMP. X — The value to convert. F — (optional) The format string for the interpretation of the textual representation of X. The default format is YYYY-MM-DD HH:MI:SS. FT — (optional) The format type. There are two format types: "ocient" and "java". If you use the "ocient" type, the function uses the Ocient-based formatting function to parse the timestamp. If you use the "java" type, the function uses the DateTimeFormatter in to parse the timestamp. For details, see DateTimeFormatter (Java Platform SE 8 ). The default is the "ocient" type. If you specify "java", you must specify a valid DateTimeFormatter format string, as there is no default. For details, see TO_TIMESTAMP. | CHAR | TIMESTAMP | to_timestamp($n, 'YYYY-MM-DD HH:MI:SS.US') to_timestamp($n, 'DD.MM.YYYY HH:MI:SS.NS') to_timestamp($n, 'uuuu-mm-dd HH:mm:ss.SSSSSS', 'java') |
For the TO_TIMESTAMP function, the Ocient formatter currently does not support variable-length sections of timestamps (i.e., the data might have a different number of fractions of seconds across records).
For example, assume this source data.
2024-01-01 12:00:00.123456
2024-01-01 12:00:00.1234
Then, you must specify the format_type as java and enter the DateTimeFormatter format string to parse the timestamps.
Character data transformation functions operate on VARCHAR data to transform a string of data.
Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
---|---|---|---|---|
CONCAT(arg1, arg2[, ...]) | Returns a string that is the result of concatenating the arguments provided. Accepts two or more arguments to concatenate. If one of the arguments to CONCAT is NULL, then CONCAT returns NULL. ℹ️ CONCAT does not support HASH or BINARY arguments in loading. |
| CHAR | CONCAT($1, $2) CONCAT(trim($first_name), '_', trim($last_name)) |
SUBSTRING(X, P, L) | Extract a substring from X beginning at the position offset P with a length L. X is the source data in CHAR or BINARY data type. P is the numeric byte position in the record that starts at index 1, which indicates the position to begin reading. L is the length of data to return starting from position P. For VARCHAR data, by default, the Ocient System treats each byte in a source field reference (e.g., $your_field) as an ASCII-encoded character, not multi-byte character data. To handle multi-byte character data such as UTF-8 in a source field, you must cast X as CHAR($your_field) or use another function that returns CHAR data. P and L represent the character offset and the number of characters to retrieve from the source data. |
| Return type matches the type of X:
| substring($order.postcode, 1, 5) as col_postcode |
TRIM(arg1[, arg2]) BTRIM(arg1[, arg2]) | Trims the leading and trailing whitespace characters (or optionally, characters specified in the arg2 argument) from the arg1 argument. |
| CHAR | TRIM('__testTrimFunction__', '_') -> 'testTrimFunction' BTRIM(' testBtrimFunction ') -> 'testBtrimFunction' |
REPLACE(arg1, arg2, arg3) | The arg1 argument is the input string for the replace operation. The arg2 argument is the pattern to replace or the target string. The arg3 argument is the pattern for the replacement or the replacement string. This function replaces occurrences of arg2 in arg1 with arg3. |
| CHAR | REPLACE('testReplaceFunctionXYZ', 'X', 'Z') -> 'testReplaceFunctionZYZ' |
LENGTH(arg1) CHAR_LENGTH(arg1) CHARACTER_LENGTH(arg1) | These functions perform the same operation of returning the character length of the input string. arg1 is the input string. ⚠️ When you use these functions with the SUBSTRING, RIGHT, or LEFT function, you might experience unintended results for multi-byte characters, such as Arabic in UTF-8 if you do not cast the field reference to CHAR. | CHAR | INT | LENGTH('hello world') -> 11 |
UPPER(arg1) UCASE(arg1) | Converts the characters in the input string to all uppercase characters. | CHAR | CHAR | UPPER('Hello there') -> 'HELLO THERE' |
LOWER(arg1) LCASE(arg1) | Converts the characters in the input string to all lowercase characters | CHAR | CHAR | LOWER('Obi Wan Kenobi') -> 'obi wan kenobi' |
LOCATE(arg1, arg2[, arg3]) POSITION(arg1, arg2[, arg3]) | Returns the index position of the first occurrence of the character value arg1 in character value arg2. Optionally, you can also include arg3 to offset the search by the specified number of spaces. |
| INT | LOCATE('F', 'testLocateFunction') -> 11 POSITION('on', 'testPositionFunction', 12) -> 19 |
LEFT(arg1, arg2) | Returns the number of characters in the arg1 string equal to the arg2 integer value. If the integer is negative, the function returns all characters except the last number of characters as specified by the arg2 value. This function treats source field references (e.g., $your_field) as single-byte ASCII characters by default, similar to the SUBSTRING function. |
| CHAR | LEFT('testLeftFunction', 4) -> 'test' LEFT('testLeftFunction', -4) -> 'testLeftFunc' |
RIGHT(arg1, arg2) | Returns the number of trailing characters in the arg1 string equal to the arg2 integer value. If the integer is negative, the function returns all characters except the first number of characters as specified by the arg2 value. This function treats source field references (e.g., $your_field) as single-byte ASCII characters by default, similar to the SUBSTRING function. |
| CHAR | RIGHT('green tea', 3) -> 'tea' RIGHT('coffee beans', -7) -> 'beans |
When you use the SUBSTRING, RIGHT, and LEFT functions with Delimited or JSON formatted data sources, the Ocient System does not treat a source field reference (e.g., $your_field) as multi-byte character data. Each byte is treated as a single-byte ASCII character. To handle multi-byte character data like UTF-8 in a source field, you must cast the source field using CHAR($your_field) or use another function that returns CHAR data.
When you use the LENGTH, CHAR_LENGTH, and CHARACTER_LENGTH functions with the SUBSTRING, RIGHT, and LEFT functions, you might experience unintended results for multi-byte characters, such as Arabic in UTF-8 if you do not cast the source field reference to CHAR.
Binary data transformation functions operate on bytes provided when you use the BINARY format type in a pipeline. These functions are typically used in processing mainframe data outputs.
Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
---|---|---|---|---|
FLOAT(bytes [, endianness]) | With a byte array in the BINARY format, parse bytes in standard IEEE 754 binary32 (single-precision) format. |
| FLOAT | BINARY: FLOAT($"[5, 4]") as float_col1 FLOAT($"[5, 4]", 'little') as float_col2 |
DOUBLE(bytes [, endianness]) | With a byte array in the BINARY format, parse bytes in standard IEEE 754 binary64 (double-precision) format.
|
| DOUBLE | BINARY: DOUBLE(FLOAT($"[5, 8]")) as double_col1 DOUBLE($"[5, 8]", 'big') as double_col2 |
TINYINT(bytes [[, signed], endianness]) | Converts a 1-byte array that represents a signed or unsigned number to the TINYINT format. ⚠️ If you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column, the function throws an error. |
| TINYINT | BINARY: TINYINT($"[9, 1]") as tiny_1 TINYINT($"[9, 1]", 'unsigned') as tiny_2 TINYINT($"[9, 1]", 'signed', 'little') as tiny_3 |
SMALLINT(bytes [[, signed], endianness]) | Converts a 2-byte array that represents a signed or unsigned number to the SMALLINT format. ⚠️ The function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column. |
| SMALLINT | BINARY: SMALLINT($"[9, 2]") as small_1 SMALLINT($"[9, 2]", 'unsigned') as small_2 SMALLINT($"[9, 2]", 'signed', 'little') as small_3 |
INT(bytes [[, signed], endianness]) | Converts a 4-byte array that represents a signed or unsigned number to the INT format. ⚠️ The function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column. |
| INT | BINARY: INT($"[9, 4]") as int_1 INT($"[9, 4]", 'unsigned') as int_2 INT($"[9, 4]", 'signed', 'little') as int_3 |
BIGINT(bytes [[, signed], endianness]) | Converts an 8-byte array that represents a signed or unsigned number to the BIGINT format. ⚠️ The function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column. |
| BIGINT | BINARY: BIGINT($"[9, 8]") as bigint_1 BIGINT($"[9, 8]", 'unsigned') as bigint_2 BIGINT($"[9, 8]", 'signed', 'little') as bigint_3 |
DECIMAL(bytes, format, precision, scale [, sign_format]) | Convert a byte array that represents packed decimal or zoned decimal BINARY data to the DECIMAL format. With this transformation, you can indicate the format of the BINARY data as either 'zoned' or 'packed'. |
| DECIMAL | BINARY: DECIMAL($"[1, 10]", 'packed', 4, 2) as revenue DECIMAL($"[1, 10]", 'zoned', 4, 2, 'leading') as revenue
|
TO_BINARY(X, F, S) | Translates a string representation of bytes and converts them to real bytes. |
|
| TO_BINARY($1, 'hex', 3) AS col_binary_1,
TO_BINARY($2, 'utf-8') AS col_binary_2 |
Other data transformation functions include miscellaneous functions such as NULL_IF, which are commonly used when working with data pipelines.
Name | Description | Supported Input Type | Return Type | Example SELECT Expression |
---|---|---|---|---|
COALESCE(arg1, arg2) | Returns arg1 if it is non-NULL; otherwise, the function returns arg2. arg1 and arg2 can be any data type, but their types must match. | Any type | Matches arg1 and arg2 type | COALESCE($1, $2) COALESCE(int($1), int($2)) COALESCE(bigint($1), 13) COALESCE(substring(char($1), 1, 2), trim($2)) |
NULL_IF(X, Y) | Returns the NULL value if the two input arguments are equal. If the arguments are not equal, the function returns X. | CHAR | CHAR | INT(NULL_IF($32, '')) INT(NULL_IF($a.b.c, ''))
|
JSON_EXTRACT_PATH_TEXT(X, D) | Extracts a value from a JSON string. |
| CHAR | JSON_EXTRACT_PATH_TEXT(JSON_UNESCAPE($real_field_key), 'b') AS col_char |
You can use the DEFAULT keyword to include the configured default value of a column in transformation functions.
Example
Expression | Description |
---|---|
Lambda Expression (T1,...,Tn) -> U | A lambda function is a temporary function that takes 0 or more named arguments as a parenthesis-enclosed list and returns an expression. The expression can reference objects defined within the closure of the expression, which includes locally named arguments that are locally defined within an enclosing scope, field references, literals, etc. For examples, see Special Data Pipeline Transformation Functions. |
Reference Function UPCASE = UPCASE(x) -> CHAR tuple = tuple(T1,...,Tn) -> tuple<<T1,...,Tn>> | A reference function points to an existing function in the Ocient function list. Specifying the function name allows you to use that function in contexts where you would otherwise need to create a full lambda function. For examples, see Special Data Pipeline Transformation Functions. |
You can create a user-defined data pipeline function using the CREATE PIPELINE FUNCTION SQL statement and Groovy. After you create data pipeline functions, you can execute them in data pipelines like any other function to transform records.
For more details, see CREATE PIPELINE FUNCTION and Data Types for User-Defined Data Pipeline Functions.