Load Data

Transform Data in Data Pipelines

Data pipeline functionality supports data transformation during loading to efficiently convert and clean source data for your chosen table schema. You can use two broad classes of data transformation: source field references or SQL transformation functions.

Source Field References

When you load data using a data pipeline, you reference extracted data using a special syntax named a source field reference. These expressions are either a named reference such as $my_field or a numeric field reference such as $1.

In structured data formats like JSON, more complex source field references allow you to transform the structured data into the required form. For example, you can extract an array of values from an array of nested objects with an expression such as $data.items[].price.

To learn more about transforming data with source field references, see:

Supported Transformation Functions in Data Pipelines

Data pipeline loading supports numerous SQL functions for use in transforming data during loading. You can use these transformation functions in the INTO ... SELECT portion of a pipeline. Transformation functions generally match the syntax and behavior of the function you use when you query in . The behavior might differ from querying behavior in two ways:

  • Some additional functions exist that provide behavior that does not exist in queries.
  • Some existing functions have extended behavior during loading that provides important capabilities that are not in queries.

The types of transformation functions supported in loading include:

  • Scalar data conversion and casting of data types
  • Special loading transformation functions
  • Date and time transformations
  • Character data transformations
  • Binary data transformations
  • Other data transformations
  • Custom user-defined Pipeline Functions using

Scalar Transformation Functions and Casting

Scalar transformation functions include constructors, scalar data conversions, and data-casting functions.

Name

Description

Supported Input Type

Return Type

Example SELECT Expression

T[](X)

Casts a string representation of an array of values or the result of a JSON array extraction expression to the corresponding array with data type T where T is a SQL data type.

Multi-dimensional arrays are supported. To indicate an N-dimensional array cast, use N sets of square brackets in the casting function.

All transformation functions can only apply to an array as a whole. To apply functions on an element-wise basis, see TRANSFORM in Special Data Pipeline Transformation Functions.

CHAR

ARRAY[]

CSV:

DECIMAL(8,4)[][] ($21) as decimal_2d_array_col

JSON:

CHAR[]($a.b.names[]) as names

DECIMAL(8,2)[][]($a[].b.c[].prices) as prices

CHAR[][][]($names[][][]) as names

 

BIGINT(X)

Converts the specified type to a 64-bit signed integer.

  • BIGINT
  • CHAR

BIGINT

CSV:

BIGINT($22) as int64_col

JSON:

BIGINT($a.b.c) as int64_col

BINARY(X)

Translates a string representation of bytes and converts them to real bytes. The string representation expects hexadecimal representations that convert to the matching bytes with this function. Hexadecimal data is not case-sensitive.

Example source data:

a891, F1D890

CHAR

BINARY or VARBINARY

CSV:

BINARY($30) as varbinary_col

JSON:

BINARY($a.b.c) as varbinary_col

BOOLEAN(X)

Handles textual representation of Booleans and converts them to its Boolean equivalent.

You can represent true as any of these values:

“t”, “true”, “yes”, “1”, “y”, “on”

false representation as any of these values:

“f”, “false”, “no”, “0”, “n”, “off”

CHAR

BOOL

CSV:

BOOLEAN($23) as bool_col

JSON:

BOOLEAN($a.b.c) as bool_col

BYTE(X)

Supports characters and number or byte array representations.

Interprets character data as the ASCII numeric representation of a byte.

If you specify a byte array, the Ocient System uses the first element of the array.

When the system converts text data to a byte, the function throws an error if the value is > 127 or < -128.

Example:

'101' -> 101

'-23' -> -23

CHAR

BYTE

CSV:

BYTE($24) as byte_col

JSON:

BYTE($a.b.c) as byte_col

CHAR(X)

Converts the specified input type to its string representation.

  • CHAR
  • BINARY
  • FLOAT
  • DOUBLE
  • INT
  • BYTE
  • SMALLINT
  • DATE
  • BOOLEAN
  • UUID
  • ARRAY

CHAR or VARCHAR

CSV:

CHAR($25) as varchar_col

CHAR($26) as char_col

JSON:

CHAR($a.b.c) as varchar_col

CHAR($a.b.c) as char_col

CHAR($array_field[]) as char_col

DECIMAL(X,P,S)

X is input data to convert to a decimal.

P is the precision of the decimal (number of significant digits) and S is the scale (number of digits after the decimal point).

You can omit P and S or only S. In that case, the Ocient System defaults P to 31 and S to 16.

Example: DECIMAL(30.0,4,2) -> 30.00

CHAR

DECIMAL

CSV:

DECIMAL($27, 8, 4) as decimal_col

JSON:

DECIMAL($a.b.c, 8, 4) as decimal_col

DOUBLE(X)

Converts the specified type to its double representation.

  • BINARY
  • CHAR

DOUBLE

CSV:

DOUBLE($28) as double_col

JSON:

DOUBLE($a.b.c) as double_col

FLOAT(X)

Converts the specified type to its float representation.

  • BINARY
  • CHAR

FLOAT

CSV:

FLOAT($29) as float_col

JSON:

FLOAT($a.b.c) as float_col

HASH(X,N)

Translates a string representation of bytes and converts them to bytes. The string representation expects hexadecimal representations that convert to the matching bytes with this function. Hexadecimal data is not case-sensitive.

X is the source data, such as:

a891, F1D890

N is the fixed number of bytes in the binary array.

CHAR

HASH

CSV:

HASH($23, 9) as fixed_hash_binary_col

JSON:

HASH($a.b.c, 9) as fixed_hash_binary_col

INT(X)

Converts the specified type to a 32-bit signed integer.

  • DECIMAL
  • CHAR
  • SMALLINT

INT

CSV:

INT($31) as int32_col

JSON:

INT($a.b.c) as int32_col

IP(X)

Converts the specified string to an IPv6 address equivalent.

X is the source data, such as:

2001:0db8:1234:5678:abcd:ef01:2345:6789

2001:db8:0:0:0:0:0:2

192.0.2.1

CHAR

IP

CSV:

IP($32) as ip_col

JSON:

IP($a.b.c) as ip_col

IPV4(X)

Converts the specified string to an IPv4 address equivalent.

X is the source data, such as:

192.0.2.1

CHAR

IPV4

CSV:

IPV4($32) as ipv4_col

JSON:

IPV4($a.b.c) as ipv4_col

SMALLINT(X)

Converts the specified value to its 16-bit signed integer representation.

  • DECIMAL
  • CHAR
  • BYTE

SMALLINT

CSV:

SMALLINT($33) as int16_col

JSON:

SMALLINT($a.b.c) as int16_col

ST_POINT(Long, Lat)

Accepts a longitude and latitude coordinate for a geospatial point. Arguments must be of type double or string values that can convert to double.

For WKT representation, see ST_POINTFROMTEXT.

 

  • Both DOUBLE
  • Both CHAR

ST_POINT

CSV:

ST_POINT($1, $2) as st_point_col

JSON:

ST_POINT($a.b.c, $d.e.f) as st_point_col

 

ST_POINTFROMTEXT(X)

Accepts the WKT text representation for ST_POINT with X and Y coordinates only.

Example WKT representation: 'POINT(-40.5318 11.8532)'

CHAR

ST_POINT

st_pointfromtext($1)

st_pointfromtext($a.b.c)

TINYINT(X)

Converts the specified value to an 8-bit signed integer.

CHAR

TINYINT

CSV:

TINYINT($34) as int8_col

JSON:

TINYINT($a.b.c) as int8_col

TUPLE <<T,[T]*>>(X,[Y]*)

Creates a tuple from the specified types.

TUPLE supports arrays of tuples, but does not support tuples within tuples.

The arguments to the TUPLE casting function can be either a single source value or must match the number of types defined in the TUPLE. If you specify a single value, the Ocient System infers individual elements of the tuple using the TUPLE format (value1,value2,value3).

Example source data:

(123,12.75) is valid text data for a TUPLE<<INT,DECIMAL(7,2)>>.

CHAR

TUPLE of type

CSV:

TUPLE<<INT,DECIMAL(7,2),TIMESTAMP,BINARY(7)>>($37) as int_decimal_timestamp_binary_tuple_col

TUPLE<<INT,DECIMAL(7,2),TIMESTAMP,BINARY(7)>>($4,$8,$1) as int_decimal_timestamp_binary_tuple_col_multi_input

JSON:

TUPLE<<INT,DECIMAL(7,2),TIMESTAMP,BINARY(7)>>($a, $b, $c)) as int_decimal_timestamp_binary_tuple_col

UUID(X)

Creates a unique identifier from the specified value.

Example source data: '237e9877-e79b-12d4-a765-321741963000'

CHAR

UUID

CSV:

UUID($34) as uuid_col

JSON:

UUID($a.b.c) as uuid_col

Special Data Pipeline Transformation Functions

Special data pipeline transformation functions include functions that only exist in loading, such as the METADATA function to load special data during pipeline execution.

Name

Description

Supported Input Type

Return Type

Example SELECT Expression

METADATA(key)

Extracts the metadata value for the specified key from available metadata for the pipeline.

For supported key values, see Load Metadata Values in Data Pipelines.

CHAR

Varies based on the metadata key.

METADATA('filename') → '/path/to/file/orders_1830374.json'

EXPLODE_OUTER(array)

Expands the contents of the specified one-dimensional array.

By default, the array can contain a maximum of 8,191 elements.

For supported options, see Special Data Pipeline Transformation Functions.

ARRAY

Elements of the array in their own row.

EXPLODE_OUTER($"data.response.responses[].own") AS own

PARSE_DELIMITED_ARRAY(text, rank, array_element_delimiter, bracket_spec)

Converts a string of text data representing an array into a CHAR[].

For example, with data in the form [1;2;3], executing PARSE_DELIMITED_ARRAY('[1;2;3]', 1, ';', '[]') returns a CHAR[] with the value ['1','2','3'].

text is CHAR data that represents the text to convert to an array.

rank is an optional argument that represents the numeric rank of the output array. The default value is 1. You can only specify rank as a literal value because you cannot retrieve it dynamically from a source field selector.

array_element_delimiter is an optional argument that contains the control character for delimiting array elements. For example, ',' indicates that a comma separates each element in the text character data. The default value definition comes from the ARRAY_ELEMENT_DELIMITER value of the data pipeline.

bracket_spec is an optional argument that is the pair of control characters for the open and close array brackets. For example, '[]' indicates that each rank of the array is enclosed in square brackets. The default definition comes from the OPEN_ARRAY and CLOSE_ARRAY characters of the data pipeline. Valid options are: '', '[]', '{}', '()', and '<>'.

CHAR[]

PARSE_DELIMITED_ARRAY($text_array)

PARSE_DELIMITED_ARRAY($text_array, 1, ';', '[]')

PARSE_DELIMITED_ARRAY($two_dim_text_array, 2, ',', '[]')

TRANSFORM(T[], (T) -> U)

Transforms an array based on the logic in a lambda expression. The first argument is an expression that evaluates to an array that contains the elements to transform. The second argument is a lambda expression (or function reference) that defines how to transform each element.

TRANSFORM returns an array of elements of the same length as the array argument of the type returned by the lambda function.

The second argument must be of the form (x T) -> U . This returns a value of type U for a specified element x. The type T of the element x is specified in the lambda function.

The second argument can also be a function reference for a single argument function.

For supported options and examples, see Special Data Pipeline Transformation Functions.

ARRAY

Lambda Function (x) -> U or Function Reference

Elements of the array transformed to the return value and type of the anonymous function.

TRANSFORM($data.names[], (el CHAR) -> LOWER(el)) as lower_case_name_array

TRANSFORM($data.names[], LOWER) as lower_case_name_array

FILTER(T[], (T) -> Boolean)

Filters elements in an array based on the logic in a lambda expression. The first argument is an expression that evaluates to an array that contains the elements to filter. The second argument is a lambda expression (or function reference) that returns true if an element should be retained.

The second argument must be of the form (x T) -> BOOL. This returns a boolean value for a specified element x. The type T of the element x is specified in the lambda function.

Elements in the first argument array are included in the resulting array when the lambda function returns true and excluded otherwise.

For supported options and examples, see Special Data Pipeline Transformation Functions.

ARRAY

Lambda Function (x T) -> BOOL or Function Reference

The elements of the array that matched the filter criteria function.

FILTER($data.names, (name CHAR) -> length(name) > 5)

ZIP_WITH(T1[], ..., Tn[], (T1, ..., Tn) -> U)

Merges two or more input arrays, element-wise, into a single array using the trailing combining function. If one array is shorter than the other(s), nulls are appended at the end to match the length of the longest array before applying.

Returns an array with elements that result from applying the function func to each pair of elements from array1 and array2. If arrays are not the same length, the shorter array is padded with NULL values.

func must be of the form (x T, y U) -> V. x is the element from array1, y is the element from array2. Types T and U are specified in the lambda function.

For supported options and examples, see Special Data Pipeline Transformation Functions.

ARRAY

ARRAY

Lambda Function (x T, y U) -> V or Function Reference

An array of length max(length(array1), length(array2)) containing elements of type V as returned by the lambda function.

ZIP_WITH($line_items[].product_name, $line_items[].sku, (name CHAR, sku CHAR) -> concat(name, '[', sku, ']'))

Date and Time Transformation Functions

Date and time transformation functions provide utilities to parse, convert, or transform DATE, TIME, and TIMESTAMP values in pipelines.

Name

Description

Supported Input Type

Return Type

Example SELECT Expression

DATE(X)

Converts the specified value to a DATE value.

This function always creates dates from a timestamp in UTC time.

If the argument is a CHAR type, the function assumes that the value has a textual representation in the YYYY-MM-DD format.

Supports any date separator character. For example, YYYY:MM:DD is also valid.

  • CHAR
  • TIMESTAMP

DATE

CSV:

DATE($24) as date_col

DATE(TIMESTAMP($24)) as date_col

JSON:

DATE($a.b.c) as date_col

DATE(TIMESTAMP($a.b.c)) as date_col

INT_TO_DATE(arg)

Converts the specified value to a DATE.

This function creates a DATE type from an INT type representing the number of days after the epoch (January 1, 1970).

INT

DATE

INT_TO_DATE($1) AS col_date_1

MICROS_TO_TIME(arg)

Converts the specified value to the TIME type, treating the argument with microsecond precision.

BIGINT

TIME

MICROS_TO_TIME($1) AS col_time

MICROS_TO_TIMESTAMP(X)

Converts the specified value to a TIMESTAMP.

This function creates a TIMESTAMP from a BIGINT that represents the number of microseconds after the Unix epoch (January 1, 1970).

BIGINT

TIMESTAMP

MICROS_TO_TIMESTAMP($35) as timestamp_col

MILLIS_TO_TIME(arg)

Converts the specified value to the TIME type, treating the argument with millisecond precision.

Alias for TIME(X) where X is the BIGINT type.

BIGINT

TIME

MILLIS_TO_TIME($1) AS col_time

MILLIS_TO_TIMESTAMP(X)

Converts the specified value to a TIMESTAMP.

This function creates a TIMESTAMP from a BIGINT that represents the number of milliseconds after the Unix epoch (January 1, 1970).

Alias for TIMESTAMP(X) where X is the BIGINT type.

BIGINT

TIMESTAMP

MILLIS_TO_TIMESTAMP($35) as timestamp_col

NANOS_TO_TIME(X)

Converts the specified value to a TIME value.

If the argument has the BIGINT type, the function treats the number with nanosecond precision.

BIGINT

TIME

CSV:

NANOS_TO_TIME($34) as time_col

NANOS_TO_TIME(BIGINT($34)) as time_col

JSON:

NANOS_TO_TIME($a.b.c) as time_col

NANOS_TO_TIME(BIGINT($a.b.c) as time_col

NANOS_TO_TIMESTAMP(X)

Converts the specified value to a TIMESTAMP.

This function creates a TIMESTAMP from a BIGINT that represents the number of nanoseconds after the Unix epoch (January 1, 1970).

BIGINT

TIMESTAMP

CSV:

NANOS_TO_TIMESTAMP($35) as timestamp_col

JSON:

NANOS_TO_TIMESTAMP($a.b.c) as timestamp_col

NANOS_TO_TIMESTAMP(bigint(1651681788123456789)) -> 2022-05-04T16:29:48.123456789

NOW()

Returns the current TIMESTAMP.

none

TIMESTAMP

now() as col_load_timestamp

SECONDS_TO_TIME(arg)

Converts the specified value to the TIME type, treating the argument with second precision.

BIGINT

TIME

SECONDS_TO_TIME($1) AS col_time

SECONDS_TO_TIMESTAMP(X)

Converts the specified value to a TIMESTAMP.

This function creates a TIMESTAMP from a BIGINT that represents the number of nanoseconds after the Unix epoch (January 1, 1970).

BIGINT

TIMESTAMP

CSV:

SECONDS_TO_TIMESTAMP($35) as timestamp_col

JSON:

SECONDS_TO_TIMESTAMP($a.b.c) as timestamp_col

NANOS_TO_TIMESTAMP(bigint(1651681788123456789)) -> 2022-05-04T16:29:48

TIME(X)

Converts the specified value to a TIME value.

If the argument has the BIGINT type, the function treats the number with millisecond precision.

If the argument is a CHAR type, the function assumes that the value has a textual representation in the HH:mm:ss format.

Supports any separation character. For example, HH/mm/ss is also valid.

  • CHAR
  • BIGINT
  • TIMESTAMP

TIME

CSV:

TIME($34) as time_col

TIME(BIGINT($34)) as time_col

JSON:

TIME($a.b.c) as time_col

TIME(BIGINT($a.b.c) as time_col

TIMESTAMP(X)

Converts the specified value to a TIMESTAMP.

This function creates a TIMESTAMP from a BIGINT that represents the number of milliseconds after the Unix epoch (January 1, 1970).

When you convert from the DATE type, the function defaults the TIMESTAMP value to the start of the day in UTC time.

If the argument is a CHAR type, the function assumes that the value has a textual representation in the YYYY-MM-DD HH:mm:ss[.SSSSSSSSS] format. In this case, the nanoseconds are optional.

The delimiters in the date-time value cannot be in a different format than the default value.

  • CHAR
  • BIGINT
  • DATE

TIMESTAMP

CSV:

TIMESTAMP($35) as timestamp_col

JSON:

TIMESTAMP($a.b.c) as timestamp_col

TIMESTAMP('2023-03-01 18:22:58.826') -> 2023-03-01T18:22:58.826

TIMESTAMP(bigint(1651681788123456789)) -> 2022-05-04T16:29:48.123

TO_DATE(X,F)

Converts the specified value to a DATE.

X — The value to convert.

F — (optional) The format string for the interpretation of the textual representation of X. The default format is YYYY-MM-DD.

For details, see TO_DATE.

CHAR

DATE

to_date($3, 'YYYY/MM/DD')

to_date($a.b.c, 'DD.MM.YYYY')

TO_TIME(X,F)

Converts the specified value to a TIME value.

X — The value to convert.

F — (optional) The format string for the interpretation of the textual representation of X. The default format is HH:MI:SS.

CHAR

TIME

to_time($n, 'HH:MI:SS')

to_time($n, 'HHMISS')

TO_TIMESTAMP(X,F,FT)

Converts the specified value to a TIMESTAMP.

X — The value to convert.

F — (optional) The format string for the interpretation of the textual representation of X. The default format is YYYY-MM-DD HH:MI:SS.

FT — (optional) The format type. There are two format types: "ocient" and "java". If you use the "ocient" type, the function uses the Ocient-based formatting function to parse the timestamp. If you use the "java" type, the function uses the DateTimeFormatter in to parse the timestamp. For details, see DateTimeFormatter (Java Platform SE 8 ). The default is the "ocient" type. If you specify "java", you must specify a valid DateTimeFormatter format string, as there is no default.

For details, see TO_TIMESTAMP.

CHAR

TIMESTAMP

to_timestamp($n, 'YYYY-MM-DD HH:MI:SS.US')

to_timestamp($n, 'DD.MM.YYYY HH:MI:SS.NS')

to_timestamp($n, 'uuuu-mm-dd HH:mm:ss.SSSSSS', 'java')

For the TO_TIMESTAMP function, the Ocient formatter currently does not support variable-length sections of timestamps (i.e., the data might have a different number of fractions of seconds across records).

For example, assume this source data.

2024-01-01 12:00:00.123456

2024-01-01 12:00:00.1234

Then, you must specify the format_type as java and enter the DateTimeFormatter format string to parse the timestamps.

Character Data Transformation Functions

Character data transformation functions operate on VARCHAR data to transform a string of data.

Name

Description

Supported Input Type

Return Type

Example SELECT Expression

CONCAT(arg1, arg2[, ...])

Returns a string that is the result of concatenating the arguments provided. Accepts two or more arguments to concatenate.

If one of the arguments to CONCAT is NULL, then CONCAT returns NULL.

ℹ️ CONCAT does not support HASH or BINARY arguments in loading.

  • arg1 CHAR
  • arg2: CHAR
  • arg3... argN (optional) CHAR

CHAR

CONCAT($1, $2)

CONCAT(trim($first_name), '_', trim($last_name))

SUBSTRING(X, P, L)

Extract a substring from X beginning at the position offset P with a length L.

X is the source data in CHAR or BINARY data type.

P is the numeric byte position in the record that starts at index 1, which indicates the position to begin reading.

L is the length of data to return starting from position P.

For VARCHAR data, by default, the Ocient System treats each byte in a source field reference (e.g., $your_field) as an ASCII-encoded character, not multi-byte character data.

To handle multi-byte character data such as UTF-8 in a source field, you must cast X as CHAR($your_field) or use another function that returns CHAR data.

P and L represent the character offset and the number of characters to retrieve from the source data.

  • X: CHAR or BINARY
  • P: INT
  • L: INT

Return type matches the type of X:

  • CHAR
  • BINARY

substring($order.postcode, 1, 5) as col_postcode



TRIM(arg1[, arg2])

BTRIM(arg1[, arg2])

Trims the leading and trailing whitespace characters (or optionally, characters specified in the arg2 argument) from the arg1 argument.

  • arg1: CHAR
  • arg2 ( optional) must be a CHAR of length 1.

CHAR

TRIM('__testTrimFunction__', '_') -> 'testTrimFunction'

BTRIM(' testBtrimFunction ') -> 'testBtrimFunction'

REPLACE(arg1, arg2, arg3)

The arg1 argument is the input string for the replace operation.

The arg2 argument is the pattern to replace or the target string.

The arg3 argument is the pattern for the replacement or the replacement string.

This function replaces occurrences of arg2 in arg1 with arg3.

  • arg1: CHAR
  • arg2: CHAR
  • arg3: CHAR

CHAR

REPLACE('testReplaceFunctionXYZ', 'X', 'Z') -> 'testReplaceFunctionZYZ'

LENGTH(arg1)

CHAR_LENGTH(arg1)

CHARACTER_LENGTH(arg1)

These functions perform the same operation of returning the character length of the input string.

arg1 is the input string.

⚠️ When you use these functions with the SUBSTRING, RIGHT, or LEFT function, you might experience unintended results for multi-byte characters, such as Arabic in UTF-8 if you do not cast the field reference to CHAR.

CHAR

INT

LENGTH('hello world') -> 11

UPPER(arg1)

UCASE(arg1)

Converts the characters in the input string to all uppercase characters.

CHAR

CHAR

UPPER('Hello there') -> 'HELLO THERE'

LOWER(arg1)

LCASE(arg1)

Converts the characters in the input string to all lowercase characters

CHAR

CHAR

LOWER('Obi Wan Kenobi') -> 'obi wan kenobi'

LOCATE(arg1, arg2[, arg3])

POSITION(arg1, arg2[, arg3])

Returns the index position of the first occurrence of the character value arg1 in character value arg2.

Optionally, you can also include arg3 to offset the search by the specified number of spaces.

  • arg1: CHAR
  • arg2: CHAR
  • arg3: LONG

INT

LOCATE('F', 'testLocateFunction') -> 11

POSITION('on', 'testPositionFunction', 12) -> 19

LEFT(arg1, arg2)

Returns the number of characters in the arg1 string equal to the arg2 integer value. If the integer is negative, the function returns all characters except the last number of characters as specified by the arg2 value.

This function treats source field references (e.g., $your_field) as single-byte ASCII characters by default, similar to the SUBSTRING function.

  • arg1: CHAR
  • arg2: LONG

CHAR

LEFT('testLeftFunction', 4) -> 'test'

LEFT('testLeftFunction', -4) -> 'testLeftFunc'

RIGHT(arg1, arg2)

Returns the number of trailing characters in the arg1 string equal to the arg2 integer value. If the integer is negative, the function returns all characters except the first number of characters as specified by the arg2 value.

This function treats source field references (e.g., $your_field) as single-byte ASCII characters by default, similar to the SUBSTRING function.

  • arg1: CHAR
  • arg2: LONG

CHAR

RIGHT('green tea', 3) -> 'tea'

RIGHT('coffee beans', -7) -> 'beans

When you use the SUBSTRING, RIGHT, and LEFT functions with Delimited or JSON formatted data sources, the Ocient System does not treat a source field reference (e.g., $your_field) as multi-byte character data. Each byte is treated as a single-byte ASCII character. To handle multi-byte character data like UTF-8 in a source field, you must cast the source field using CHAR($your_field) or use another function that returns CHAR data.

When you use the LENGTH, CHAR_LENGTH, and CHARACTER_LENGTH functions with the SUBSTRING, RIGHT, and LEFT functions, you might experience unintended results for multi-byte characters, such as Arabic in UTF-8 if you do not cast the source field reference to CHAR.

Binary Data Transformation Functions

Binary data transformation functions operate on bytes provided when you use the BINARY format type in a pipeline. These functions are typically used in processing mainframe data outputs.

Name

Description

Supported Input Type

Return Type

Example SELECT Expression

FLOAT(bytes [, endianness])

With a byte array in the BINARY format, parse bytes in standard IEEE 754 binary32 (single-precision) format.

  • bytes is a 4-byte array extracted from BINARY data.
  • endianness (optional) must be a literal string. Valid values include 'big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition.

FLOAT

BINARY:

FLOAT($"[5, 4]") as float_col1

FLOAT($"[5, 4]", 'little') as float_col2

DOUBLE(bytes [, endianness])

With a byte array in the BINARY format, parse bytes in standard IEEE 754 binary64 (double-precision) format.

 

  • bytes is an 8-byte array extracted from BINARY data.
  • endianness (optional) must be a literal string. Valid values include 'big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition.

DOUBLE

BINARY:

DOUBLE(FLOAT($"[5, 8]")) as double_col1

DOUBLE($"[5, 8]", 'big') as double_col2

TINYINT(bytes [[, signed], endianness])

Converts a 1-byte array that represents a signed or unsigned number to the TINYINT format.

⚠️ If you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column, the function throws an error.

  • bytes is a single-byte array extracted from BINARY data.
  • signed is a string literal that indicates if the bytes represent a signed or unsigned BINARY format. Valid values include 'signed' and 'unsigned'. The default value is 'signed'.
  • endianness is a string literal that indicates whether the bytes are in the big or little endian format. Valid values include ‘big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition.

TINYINT

BINARY:

TINYINT($"[9, 1]") as tiny_1

TINYINT($"[9, 1]", 'unsigned') as tiny_2

TINYINT($"[9, 1]", 'signed', 'little') as tiny_3

SMALLINT(bytes [[, signed], endianness])

Converts a 2-byte array that represents a signed or unsigned number to the SMALLINT format.

⚠️ The function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column.

  • bytes is a 2-byte array extracted from BINARY data.
  • signed is a string literal that indicates if the bytes represent a signed or unsigned BINARY format. Valid values include 'signed' and 'unsigned'. The default value is 'signed'.
  • endianness is a string literal that indicates whether the bytes are big or little in endian format. Valid values include ‘big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition.

SMALLINT

BINARY:

SMALLINT($"[9, 2]") as small_1

SMALLINT($"[9, 2]", 'unsigned') as small_2

SMALLINT($"[9, 2]", 'signed', 'little') as small_3

INT(bytes [[, signed], endianness])

Converts a 4-byte array that represents a signed or unsigned number to the INT format.

⚠️ The function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column.

  • bytes is a 4-byte array extracted from BINARY data.
  • signed is a string literal that indicates if the bytes represent a signed or unsigned BINARY format. Valid values include 'signed' and 'unsigned'. The default value is 'signed'.
  • endianness is a string literal that indicates whether the bytes are big or little endian format. Valid values include ‘big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition.

INT

BINARY:

INT($"[9, 4]") as int_1

INT($"[9, 4]", 'unsigned') as int_2

INT($"[9, 4]", 'signed', 'little') as int_3

BIGINT(bytes [[, signed], endianness])

Converts an 8-byte array that represents a signed or unsigned number to the BIGINT format.

⚠️ The function throws an error if you do not specify the correct number of bytes or if an unsigned transform overflows a signed target column.

  • bytes is an 8-byte array extracted from BINARY data.
  • signed is a string literal that indicates if the bytes represent a signed or unsigned binary format. Valid values include 'signed' and 'unsigned'. The default value is 'signed'.
  • endianness is a string literal that indicates whether the bytes are big or little endian format. Valid values include ‘big' and 'little'. The default value is the ENDIANNESS option in the pipeline definition.

BIGINT

BINARY:

BIGINT($"[9, 8]") as bigint_1

BIGINT($"[9, 8]", 'unsigned') as bigint_2

BIGINT($"[9, 8]", 'signed', 'little') as bigint_3

DECIMAL(bytes, format, precision, scale [, sign_format])

Convert a byte array that represents packed decimal or zoned decimal BINARY data to the DECIMAL format.

With this transformation, you can indicate the format of the BINARY data as either 'zoned' or 'packed'.



  • bytes is an array of bytes extracted from BINARY data.
  • format is a string literal that indicates the data format. Valid values include 'packed' (meaning the array contains decimal data in each nibble for two digits per byte) and 'zoned' (meaning the array contains decimal data in each nibble for one digit per byte).
  • precision is the total number of significant digits.
  • scale is the number of digits after the decimal point.
  • sign_format is a string literal that indicates how the transformation should extract sign information from the bytes. Valid values include 'leading' (means that the system stores the sign in the upper nibble of the first byte for both 'zoned' and 'packed' data formats), 'leading_separate' (means the system stores the sign as a separate new first byte from the digit data), 'trailing' (means that the system stores the sign in the upper nibble of the last byte when the data format is set to 'zoned' and in the lower nibble of the last byte when set to 'packed'), or 'trailing_separate' (means the system stores the sign as a separate new last byte from the digit data). For 'leading' and 'trailing' sign information, accepted sign values are 0x7, 0xB, and 0xD for negative signs and 0xA, 0xC, 0xE, and 0xF for positive. The accepted sign values for 'leading_separate' and 'trailing_separate' sign information are 0x2D for - negative signs and 0x2B for + positive. The default value for sign_format is 'trailing' for both format data formats.

DECIMAL

BINARY:

DECIMAL($"[1, 10]", 'packed', 4, 2) as revenue

DECIMAL($"[1, 10]", 'zoned', 4, 2, 'leading') as revenue

 

TO_BINARY(X, F, S)







Translates a string representation of bytes and converts them to real bytes.

  • X is the string representation of bytes in BINARY or HASH format.
  • F is either 'hex' for hexadecimal representations or the character set name (such as 'utf-8') for character representations.
  • S (optional) is the number of bytes that should be present in the translated value. The default 0 means that the number of bytes can vary.
  • BINARY
  • HASH



TO_BINARY($1, 'hex', 3) AS col_binary_1, TO_BINARY($2, 'utf-8') AS col_binary_2

Other Data Transformation Functions

Other data transformation functions include miscellaneous functions such as NULL_IF, which are commonly used when working with data pipelines.

Name

Description

Supported Input Type

Return Type

Example SELECT Expression

COALESCE(arg1, arg2)



Returns arg1 if it is non-NULL; otherwise, the function returns arg2. arg1 and arg2 can be any data type, but their types must match.

Any type

Matches arg1 and arg2 type

COALESCE($1, $2)

COALESCE(int($1), int($2))

COALESCE(bigint($1), 13)

COALESCE(substring(char($1), 1, 2), trim($2))

NULL_IF(X, Y)

Returns the NULL value if the two input arguments are equal.

If the arguments are not equal, the function returns X.

CHAR

CHAR

INT(NULL_IF($32, ''))

INT(NULL_IF($a.b.c, ''))

 

JSON_EXTRACT_PATH_TEXT(X, D)

Extracts a value from a JSON string.

  • X is the JSON string for extracting a value in CHAR type.
  • D is the key of the value to extract in CHAR type.

CHAR

JSON_EXTRACT_PATH_TEXT(JSON_UNESCAPE($real_field_key), 'b') AS col_char

DEFAULT Keyword

You can use the DEFAULT keyword to include the configured default value of a column in transformation functions.

Example

SQL


Lambda Expressions

Expression

Description

Lambda Expression

(T1,...,Tn) -> U

A lambda function is a temporary function that takes 0 or more named arguments as a parenthesis-enclosed list and returns an expression. The expression can reference objects defined within the closure of the expression, which includes locally named arguments that are locally defined within an enclosing scope, field references, literals, etc.

Reference Function

UPCASE = UPCASE(x) -> CHAR

tuple = tuple(T1,...,Tn) -> tuple<<T1,...,Tn>>

A reference function points to an existing function in the Ocient function list. Specifying the function name allows you to use that function in contexts where you would otherwise need to create a full lambda function.

User-Defined Data Pipeline Functions

You can create a user-defined data pipeline function using the CREATE PIPELINE FUNCTION SQL statement and Groovy. After you create data pipeline functions, you can execute them in data pipelines like any other function to transform records.

Related Links