Load Data
Transform Data in Data Pipelines
data pipeline functionality supports data transformation during loading to efficiently convert and clean source data for your chosen table schema you can use two broad classes of data transformation source field references or sql transformation functions source field references when you load data using a data pipeline, you reference extracted data using a special syntax named a source field reference these expressions are either a named reference such as $my field or a numeric field reference such as $1 in structured data formats like json, more complex source field references allow you to transform the structured data into the required form for example, you can extract an array of values from an array of nested objects with an expression such as $data items\[] price to learn more about transforming data with source field references, see data formats for data pipelines docid 1 jhzrhblgnqucl6skiaq data types for data pipelines docid 7s5nztl8lolpwt2pcnjyz supported transformation functions in data pipelines data pipeline loading supports numerous sql functions for use in transforming data during loading you can use these transformation functions in the into select portion of a pipeline transformation functions generally match the syntax and behavior of the function you use when you query in {{ocient}} the behavior might differ from querying behavior in two ways some additional functions exist that provide behavior that does not exist in queries some existing functions have extended behavior during loading that provides important capabilities that are not in queries the types of transformation functions supported in loading include scalar data conversion and casting of data types special loading transformation functions date and time transformations character data transformations binary data transformations array data transformations logical operations transformations network type functions other data transformations custom user defined data pipelines docid\ l8tdfpfzzvzeyabc2h7bq using {{groovy}} scalar transformation functions and casting scalar transformation functions include constructors, scalar data conversions, and data casting functions name description supported input type return type example select expression t\[]\(x) casts a string representation of an array of values or the result of a json array extraction expression to the corresponding array with data type t where t is a sql data type multi dimensional arrays are supported to indicate an n dimensional array cast, use n sets of square brackets in the casting function all transformation functions can only apply to an array as a whole to apply functions on an element wise basis, see transform in special data pipeline transformation functions docid 7h6mczxhldiameojdksvu char array\[] csv decimal(8,4)\[]\[] ($21) as decimal 2d array col json char\[]\($a b names\[]) as names decimal(8,2)\[]\[]\($a\[] b c\[] prices) as prices char\[]\[]\[]\($names\[]\[]\[]) as names bigint(x) converts the specified type to a 64 bit signed integer bigint char bigint csv bigint($22) as int64 col json bigint($a b c) as int64 col binary(x) translates a string representation of bytes and converts them to real bytes the string representation expects hexadecimal representations that convert to the matching bytes with this function hexadecimal data is not case sensitive example source data a891 , f1d890 char binary or varbinary csv binary($30) as varbinary col json binary($a b c) as varbinary col boolean(x) handles textual representation of booleans and converts them to its boolean equivalent you can represent true as any of these values “t”, “true”, “yes”, “1”, “y”, “on” false representation as any of these values “f”, “false”, “no”, “0”, “n”, “off” char bool csv boolean($23) as bool col json boolean($a b c) as bool col byte(x) supports characters and number or byte array representations interprets character data as the ascii numeric representation of a byte if you specify a byte array, the ocient system uses the first element of the array when the system converts text data to a byte, the function throws an error if the value is > 127 or < 128 example '101' > 101 ' 23' > 23 char byte csv byte($24) as byte col json byte($a b c) as byte col char(x) converts the specified input type to its string representation char binary float double int byte smallint date b oolean uuid array char or varchar csv char($25) as varchar col char($26) as char col json char($a b c) as varchar col char($a b c) as char col char($array field\[]) as char col decimal(x,p,s) x is input data to convert to a decimal p is the precision of the decimal (number of significant digits) and s is the scale (number of digits after the decimal point) you can omit p and s or only s in that case, the ocient system defaults p to 31 and s to 16 example decimal(30 0,4,2) > 30 00 char decimal csv decimal($27, 8, 4) as decimal col json decimal($a b c, 8, 4) as decimal col double(x) converts the specified type to its double representation binary char double csv double($28) as double col json double($a b c) as double col float(x) converts the specified type to its float representation binary char float csv float($29) as float col json float($a b c) as float col hash(x,n) translates a string representation of bytes and converts them to bytes the string representation expects hexadecimal representations that convert to the matching bytes with this function hexadecimal data is not case sensitive x is the source data, such as a891 , f1d890 n is the fixed number of bytes in the binary array char hash csv hash($23, 9) as fixed hash binary col json hash($a b c, 9) as fixed hash binary col int(x) converts the specified type to a 32 bit signed integer decimal char smallint int csv int($31) as int32 col json int($a b c) as int32 col ip(x) converts the specified string to an ipv6 address equivalent x is the source data, such as 2001 0db8 1234 5678\ abcd\ ef01 2345 6789 2001\ db8 0 0 0 0 0 2 192 0 2 1 char ip csv ip($32) as ip col json ip($a b c) as ip col ipv4(x) converts the specified string to an ipv4 address equivalent x is the source data, such as 192 0 2 1 char ipv4 csv ipv4($32) as ipv4 col json ipv4($a b c) as ipv4 col smallint(x) converts the specified value to its 16 bit signed integer representation decimal char byte smallint csv smallint($33) as int16 col json smallint($a b c) as int16 col st point(long, lat) accepts a longitude and latitude coordinate for a geospatial point arguments must be of type double or string values that can convert to double for wkt representation, see point constructors docid\ c aaugo19qr znvjihgsy both double both char st point csv st point($1, $2) as st point col json st point($a b c, $d e f) as st point col st pointfromtext(x) accepts the wkt text representation for st point with x and y coordinates only example wkt representation 'point( 40 5318 11 8532)' char st point st pointfromtext($1) st pointfromtext($a b c) tinyint(x) converts the specified value to an 8 bit signed integer char tinyint csv tinyint($34) as int8 col json tinyint($a b c) as int8 col tuple <\<t,\[t] >>(x,\[y] ) creates a tuple from the specified types tuple supports arrays of tuples, but does not support tuples within tuples the arguments to the tuple casting function can be either a single source value or must match the number of types defined in the tuple if you specify a single value, the ocient system infers individual elements of the tuple using the tuple format (value1,value2,value3) example source data (123,12 75) is valid text data for a tuple<\<int,decimal(7,2)>> char tuple of type csv tuple<\<int,decimal(7,2),timestamp,binary(7)>>($37) as int decimal timestamp binary tuple col tuple<\<int,decimal(7,2),timestamp,binary(7)>>($4,$8,$1) as int decimal timestamp binary tuple col multi input json tuple<\<int,decimal(7,2),timestamp,binary(7)>>($a, $b, $c)) as int decimal timestamp binary tuple col uuid(x) creates a unique identifier from the specified value example source data '237e9877 e79b 12d4 a765 321741963000' char uuid csv uuid($34) as uuid col json uuid($a b c) as uuid col special data pipeline transformation functions special data pipeline transformation functions include functions that only exist in loading, such as the metadata function to load special data during pipeline execution name description supported input type return type example select expression explode outer(array) expands the contents of the specified one dimensional array by default, the array can contain a maximum of 8,191 elements for supported options, see special data pipeline transformation functions docid 7h6mczxhldiameojdksvu array elements of the array in their own row explode outer($"data response responses\[] own") as own filter(t\[], (t) > boolean) filters elements in an array based on the logic in a lambda expression the first argument is an expression that evaluates to an array that contains the elements to filter the second argument is a lambda expression (or function reference) that returns true if an element should be retained the second argument must be of the form (x t) > bool this returns a boolean value for a specified element x the type t of the element x is specified in the lambda function elements in the first argument array are included in the resulting array when the lambda function returns true and excluded otherwise for supported options and examples, see special data pipeline transformation functions docid 7h6mczxhldiameojdksvu array lambda function (x t) > bool or function reference the elements of the array that matched the filter criteria function filter($data names, (name char) > length(name) > 5) metadata(key) extracts the metadata value for the specified key from available metadata for the pipeline for supported key values, see load metadata values in data pipelines docid\ a3n4wkcawrpo1gtefetmm char varies based on the metadata key metadata('filename') → '/path/to/file/orders 1830374 json' parse delimited array(text, rank, array element delimiter, bracket spec) converts a string of text data representing an array into a char\[] for example, with data in the form \[1;2;3] , executing parse delimited array('\[1;2;3]', 1, ';', '\[]') returns a char\[] with the value \['1','2','3'] text is char data that represents the text to convert to an array rank is an optional argument that represents the numeric rank of the output array the default value is 1 you can only specify rank as a literal value because you cannot retrieve it dynamically from a source field selector array element delimiter is an optional argument that contains the control character for delimiting array elements for example, ',' indicates that a comma separates each element in the text character data the default value definition comes from the array element delimiter value of the data pipeline bracket spec is an optional argument that is the pair of control characters for the open and close array brackets for example, '\[]' indicates that each rank of the array is enclosed in square brackets the default definition comes from the open array and close array characters of the data pipeline valid options are '' , '\[]' , '{}' , '()' , and '<>' char\[] parse delimited array($text array) parse delimited array($text array, 1, ';', '\[]') parse delimited array($two dim text array, 2, ',', '\[]') transform(t\[], (t) > u) transforms an array based on the logic in a lambda expression the first argument is an expression that evaluates to an array that contains the elements to transform the second argument is a lambda expression (or function reference) that defines how to transform each element transform returns an array of elements of the same length as the array argument of the type returned by the lambda function the second argument must be of the form (x t) > u this returns a value of type u for a specified element x the type t of the element x is specified in the lambda function the second argument can also be a function reference for a single argument function for supported options and examples, see special data pipeline transformation functions docid 7h6mczxhldiameojdksvu array lambda function (x) > u or function reference elements of the array transformed to the return value and type of the anonymous function transform($data names\[], (el char) > lower(el)) as lower case name array transform($data names\[], lower) as lower case name array width bucket(x,min,max,n) creates n equal width buckets in the range \[min,max) as a histogram if x is less than min , then the function returns 0 if x is greater than or equal to the value of max , the function returns n+1 buckets if x is less than the value of max and greater than or equal to the value of min , the function returns the number of the bucket corresponding to where the input value belongs in the histogram the values are inclusive on the lower bound and exclusive on the upper bound, so all buckets are the same width requesting 100 buckets retrieves 102 possible buckets, bucket 0 through 101, where 0 captures all values below the minimum and 101 captures all values at the maximum and beyond x is the value for bucket determination min is the minimum value for the histogram max is the maximum value for the histogram n is the number of buckets in the histogram integer width bucket('$2', 1, 5, 2) as output zip with(array1, , array2, func) merges two or more input arrays, element wise, into a single array using the trailing combining function if one array is shorter than the others, the system appends null values at the end to match the length of the longest array before applying returns an array with elements that result from applying the function func to each pair of elements from array1 and array2 if the arrays are not the same length, the system pads the shorter array with null values func must be of the form (x t, y u) > v , where x is the element from array1 and y is the element from array2 types t and u are specified in the lambda function for supported options and examples, see special data pipeline transformation functions docid 7h6mczxhldiameojdksvu array array lambda function (x t, y u) > v or function reference an array of length max(length(array1), length(array2)) containing elements of type v as returned by the lambda function zip with($line items\[] product name, $line items\[] sku, (name char, sku char) > concat(name, '\[', sku, ']')) date and time transformation functions date and time transformation functions provide utilities to parse, convert, or transform date , time , and timestamp values in pipelines name description supported input type return type example select expression date(x) converts the specified value to a date value this function always creates dates from a timestamp in utc time if the argument is a char type, the function assumes that the value has a textual representation in the yyyy mm dd format supports any date separator character for example, yyyy\ mm\ dd is also valid char timestamp date csv date($24) as date col date(timestamp($24)) as date col json date($a b c) as date col date(timestamp($a b c)) as date col int to date(arg) converts the specified value to a date this function creates a date type from an int type representing the number of days after the {{unix}} epoch (january 1, 1970) int date int to date($1) as col date 1 micros to time(arg) converts the specified value to the time type, treating the argument with microsecond precision bigint time micros to time($1) as col time micros to timestamp(x) converts the specified value to a timestamp this function creates a timestamp from a bigint that represents the number of microseconds after the unix epoch (january 1, 1970) bigint timestamp micros to timestamp($35) as timestamp col millis to time(arg) converts the specified value to the time type, treating the argument with millisecond precision alias for time(x) where x is the bigint type bigint time millis to time($1) as col time millis to timestamp(x) converts the specified value to a timestamp this function creates a timestamp from a bigint that represents the number of milliseconds after the unix epoch (january 1, 1970) alias for timestamp(x) where x is the bigint type bigint timestamp millis to timestamp($35) as timestamp col nanos to time(x) converts the specified value to a time value if the argument has the bigint type, the function treats the number with nanosecond precision bigint time csv nanos to time($34) as time col nanos to time(bigint($34)) as time col json nanos to time($a b c) as time col nanos to time(bigint($a b c) as time col nanos to timestamp(x) converts the specified value to a timestamp this function creates a timestamp from a bigint that represents the number of nanoseconds after the unix epoch (january 1, 1970) bigint timestamp csv nanos to timestamp($35) as timestamp col json nanos to timestamp($a b c) as timestamp col nanos to timestamp(bigint(1651681788123456789)) > 2022 05 04t16 29 48 123456789 now() returns the current timestamp none timestamp now() as col load timestamp seconds to time(arg) converts the specified value to the time type, treating the argument with second precision bigint time seconds to time($1) as col time seconds to timestamp(x) converts the specified value to a timestamp this function creates a timestamp from a bigint that represents the number of nanoseconds after the unix epoch (january 1, 1970) bigint timestamp csv seconds to timestamp($35) as timestamp col json seconds to timestamp($a b c) as timestamp col nanos to timestamp(bigint(1651681788123456789)) > 2022 05 04t16 29 48 time(x) converts the specified value to a time value if the argument has the bigint type, the function treats the number with millisecond precision if the argument is a char type, the function assumes that the value has a textual representation in the hh\ mm\ ss format supports any separation character for example, hh/mm/ss is also valid char bigint timestamp time csv time($34) as time col time(bigint($34)) as time col json time($a b c) as time col time(bigint($a b c) as time col timestamp(x) converts the specified value to a timestamp this function creates a timestamp from a bigint that represents the number of milliseconds after the unix epoch (january 1, 1970) when you convert from the date type, the function defaults the timestamp value to the start of the day in utc time if the argument is a char type, the function assumes that the value has a textual representation in the yyyy mm dd hh\ mm\ ss\[ sssssssss] format in this case, the nanoseconds are optional the delimiters in the date time value cannot be in a different format than the default value char bigint date timestamp csv timestamp($35) as timestamp col json timestamp($a b c) as timestamp col timestamp('2023 03 01 18 22 58 826') > 2023 03 01t18 22 58 826 timestamp(bigint(1651681788123456789)) > 2022 05 04t16 29 48 123 to date(x,f) converts the specified value to a date x — the value to convert f — (optional) the format string for the interpretation of the textual representation of x the default format is yyyy mm dd for details, see formatting functions docid\ u47hjysi1oojk6eg2kjut char date to date($3, 'yyyy/mm/dd') to date($a b c, 'dd mm yyyy') to time(x,f) converts the specified value to a time value x — the value to convert f — (optional) the format string for the interpretation of the textual representation of x the default format is hh\ mi\ ss char time to time($n, 'hh\ mi\ ss') to time($n, 'hhmiss') to timestamp(x,f,ft) converts the specified value to a timestamp x — the value to convert f — (optional) the format string for the interpretation of the textual representation of x the default format is yyyy mm dd hh\ mi\ ss ft — (optional) the format type there are two format types "ocient" and "java" if you use the "ocient" type, the function uses the ocient based formatting function to parse the timestamp if you use the "java" type, the function uses the datetimeformatter in {{java}} to parse the timestamp for details, see datetimeformatter (java platform se 8 ) https //docs oracle com/javase/8/docs/api/java/time/format/datetimeformatter html the default is the "ocient" type if you specify "java" , you must specify a valid datetimeformatter format string, as there is no default for details, see formatting functions docid\ u47hjysi1oojk6eg2kjut char timestamp to timestamp($n, 'yyyy mm dd hh\ mi\ ss us') to timestamp($n, 'dd mm yyyy hh\ mi\ ss ns') to timestamp($n, 'uuuu mm dd hh\ mm\ ss ssssss', 'java') for the to timestamp function, the ocient formatter currently does not support variable length sections of timestamps (i e , the data might have a different number of fractions of seconds across records) for example, assume this source data 2024 01 01 12 00 00 123456 2024 01 01 12 00 00 1234 then, you must specify the format type as java and enter the datetimeformatter format string to parse the timestamps character data transformation functions character data transformation functions operate on varchar data to transform a string of data true falsefalse unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type when you use the substring, right, and left functions with delimited or json formatted data sources, the ocient system does not treat a source field reference (e g , $your field ) as multi byte character data each byte is treated as a single byte ascii character to handle multi byte character data like utf 8 in a source field, you must cast the source field using char($your field) or use another function that returns char data when you use the length, char length, and character length functions with the substring, right, and left functions, you might experience unintended results for multi byte characters, such as arabic in utf 8 if you do not cast the source field reference to char binary data transformation functions binary data transformation functions operate on bytes provided when you use the binary format type in a pipeline these functions are typically used in processing mainframe data outputs true falsefalse unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type array data transformation functions true falsefalse unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type logical operations transformation functions these transformation functions execute logical operations on the data in the data pipeline true falsefalse unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type network type functions network type functions are also supported in loading for example, subnet(ip($9),13) as ipv6 subnet form1 in the create pipeline select sql statement converts the input data in the column to an ipv6 mapped ipv4 address and then computes the prefix of size 13 for that address for the supported network type functions, see network type functions docid\ bjcnqtthbmpj thiwmdlw other data transformation functions other data transformation functions include the extraction of values from the json format and miscellaneous functions such as coalesce , which are commonly used when working with data pipelines true falsefalse unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type default keyword you can use the default keyword to include the configured default value of a column in transformation functions example coalesce($my field, default) as my column lambda expressions true false 0,374false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type false unhandled content type user defined data pipeline functions you can create a user defined data pipeline function using the create pipeline function sql statement and groovy after you create data pipeline functions, you can execute them in data pipelines like any other function to transform records for more details, see data pipelines docid\ l8tdfpfzzvzeyabc2h7bq and data types for user defined data pipeline functions docid\ vk8kyybwfton5ax4e q1i related links formatting functions docid\ u47hjysi1oojk6eg2kjut data types for data pipelines docid 7s5nztl8lolpwt2pcnjyz data formats for data pipelines docid 1 jhzrhblgnqucl6skiaq load metadata values in data pipelines docid\ a3n4wkcawrpo1gtefetmm data pipeline load of json data from kafka docid 2ua4kqafcqplu gi6oorz data pipeline load of parquet data from s3 docid\ yqk wibdyxiq8dxewhxhf data pipeline load of csv data from s3 docid 5xxtimfhjnlyxs48wxsxs