Load Data

Data Formats for Data Pipelines

Loading in differs in subtle ways that depend on the data format of the source. Loading runs with strict interpretation of source data to allow pipelines to achieve maximum performance. For text-based formats like JSON and DELIMITED, the Ocient System performs no preemptive casting on the data when using a source field selector.

For example, with the JSON string { "my_field": 1234 }, the selector $my_field returns the string "1234" not the integer 1234.

When you use transformation functions, keep in mind that the Ocient System treats all data in the JSON and DELIMITED formats as text data.

While the Ocient System sends data you select to a final target column, the system automatically casts the data in the final step to ensure that the data is compatible with the target column type. See Data Types for Data Pipelines for supported automatic conversion rules.

Format-specific differences also appear in the pipelines.

Load Delimited and CSV Data

When you load data from delimited or CSV files, the Ocient System tokenizes the data during loading. The system detects records and fields in the input data during pipeline execution. You can reference fields and use them in combination with transformation functions before the system stores values in the column of a target table.

Referencing fields of the source data for the formats happens by using a field index. The index is a number that follows the dollar sign $. To maintain consistency with SQL array semantics, the field indexes start at 1.

Reference the first field of tokenized source records for the DELIMITED and CSV formats as $1.

For the BINARY format, $0 represents the entire record. In this case, you must specify $0 in combination with the SUBSTRING function to extract specific bytes from the source data.

For a complete list of supported options for DELIMITED and CSV data format, see Delimited and CSV Extract Options.

Delimited Loading Example

Use this example delimited data.

Text


For this example row, this table shows the field references for each value in the row.

Field Reference

Value

$1

iphone

$2

60607

$3

viewed

$4

502

$5

293.99

$6[]

['shopping','news']

To load this data in a pipeline with the DELIMITED data format, this CREATE PIPELINE statement specifies the | character for the field delimiter. This statement loads data into S3. The SELECT statement uses fields 1, 2, 3, 5, and 6 of the source data. The statement specifies that the system should not load field 4 to the target table. Field 6 is an array of data matching the default array settings for delimited data. You can indicate this with the array brackets like $6[] to load into a CHAR[] typed column.

The outer casting functions in this example are optional and shown for completeness. If they are omitted, the pipeline automatically casts the source fields to the target column type.

SQL


Load JSON Data

The data pipeline syntax enables the load of JSON data, including nested scalars, arrays, and points ST_POINT.

Strict Loading and Transformations

When you use transformation functions, remember that the Ocient System treats all data in JSON and DELIMITED format as text data, not the logical data type.

For example, if you specify the JSON string { "my_timestamp": 1709208000000 }, the selector $my_timestamp returns the string "1709208000000" and not the integer 1709208000000.

As a result, if you cast this data into a timestamp column, such as TIMESTAMP($my_timestamp) as created_at, the Ocient System returns an error. The conversion fails because the cast function assumes you are specifying TIMESTAMP(VARCHAR), which assumes a format like YYYY-MM-DD HH🇲🇲ss[.SSSSSSSSS].

To correct this issue, cast the value explicitly to make use of the TIMESTAMP(BIGINT) function that treats the argument as milliseconds after the epoch as in TIMESTAMP(BIGINT($my_timestamp)) as created_at.

Supported JSON Selectors

JSON selectors consist of $ followed by a dot-separated list of JSON keys. If a key refers to an array, it is followed by a set of brackets [] to correspond to its dimensionality. If the square brackets contain an index, like [1], then the selector refers to an array element.

The Ocient System treats JSON selectors as lowercase. To use case-sensitive selectors, you must enclose the selector in double quotation marks. For example, $"testSelector". With case-sensitive selectors having multiple JSON keys, each key needs double quotation marks. For example, $"testData"."Responses"."SuccessResponse".

For special characters (any identifier that starts with any character other than a letter or contains any character that is not a letter, number, or an underscore) or reserved SQL keywords (such as SELECT), you must enclose such selectors in double quotation marks. For example, if you have a JSON document { "test-field": 123 }, then the selector for the query should be $"test-field".

The Ocient System does not support identifiers with a backslash as the last character in the key name.

Selector

Description

Examples

$a

Scalar — Refers to key a at the top of a JSON document, where the value of a is a scalar.

{ "first_name": "Chris" }

$first_name -> "Chris"

$a[]

Array — Refers to key a at the top of a JSON document, where the value of a is a one-dimensional array.

{ "names": ["Chris", "Joe"] }

$names[] -> ["Chris","Joe"]

$a[1]

Array Element Selection — Selects a single element from the array a at the top of a JSON document, where the value of a is a one-dimensional array. Also works on multi-dimensional arrays. Indexes start at 1.

{ "names": ["Chris", "Joe"] }

$names[2] -> "Joe"

$a[][]

Multi-dimensional Array — Refers to key a at the top of a JSON document, where the value of a is a two-dimensional array.

{ "coordinates": [ [87.31, 18.25],[65.22, 19.41]] }

$coordinates[][] -> [[87.31,18.25],[65.22,19.41]]

$a.b.c

Nested Selector — Drills into the value at key a, then value at key b, and refers to the value at key c, where none of the values are arrays.

{ "user": { "name": "Chris" } }

$user.name -> "Chris"

$a[].b

Array Projection — A projection applies a JSON selector on all elements in an array, returning an array.

Array brackets in the notation indicate which items are the arrays.

The a[] is an array. The b selector is applied to each element in the array a.

{ "users": [ { "name": "Chris", "orders": [ { "order_id": 1, "subtotal": 19.29 }, { "order_id": 2, "subtotal": 16.71 } ] }, { "name": "Joe", "orders": [ { "order_id": 3, "subtotal": 17.29 }, { "order_id": 4, "subtotal": 22.49 } ] } ] }

$users[].name → ["Chris", "Joe"]

$a.b[].c.d[]

Multi-level Array Projection — Applies selectors to each element in the specified arrays and allows multiple levels of objects and arrays for the selection of an N-dimensional array of values.

The values of b and d are arrays, resulting in two-dimensional arrays. The selector c.d[] is projected into the elements in the array b.

{ "users": [ { "name": "Chris", "orders": [ { "order_id": 1, "subtotal": 19.29 }, { "order_id": 2, "subtotal": 16.71 } ] }, { "name": "Joe", "orders": [ { "order_id": 3, "subtotal": 17.29 }, { "order_id": 4, "subtotal": 22.49 } ] } ] }

$users[].orders[].order_id -> [[1,2],[3,4]]

$a.b[1].c.d[]

Array Slice — Selects the first element from the array b and drills into c and then the array d. JSON array selectors have an index that starts at 0.

{ "users": [ { "name": "Chris", "orders": [ { "order_id": 1, "subtotal": 19.29 }, { "order_id": 2, "subtotal": 16.71 } ] }, { "name": "Joe", "orders": [ { "order_id": 3, "subtotal": 17.29 }, { "order_id": 4, "subtotal": 22.49 } ] } ] }

$users[1].orders[].subtotal -> [17.29,22.49]

$a.{b,c}

Tuple Selector — Selects the values from an object as a tuple. You must cast the values to an appropriate TUPLE data type.

{ "user": { "first_name": "John", "last_name": "Doe" }}

$user.{first_name,last_name} -> <<"John","Doe">>

$a[].{b,c}

Array of Tuples Selector — Selects the values from an array of objects as an array of tuples. You must cast the values to an appropriate TUPLE[] data type.

{ "users": [ { "first_name": "John", "last_name": "Doe" }, { "first_name": "Steve", "last_name": "Smith" } ]}

$users[].{first_name,last_name} -> [<<"John","Doe">>,<<"Steve","Smith">>]

For more examples of using JSON selectors in data pipelines, see JSON Selectors Examples in Data Pipelines.

NULL and Empty Handling for JSON Scalars

The Ocient System handles all JSON NULL, empty, and missing values in the same way. The system loads these values as NULL. These values fail to load into non-nullable columns.

Provide an explicit default in the pipeline using IF_NULL or COALESCE or use the COLUMN_DEFAULT_IF_NULL option to accept the configured column default instead of attempting to load NULL values.

NULL and Empty Handling for JSON Arrays

The Ocient System handles NULL, empty, and missing values the same way for arrays as for scalars. The system converts a value that is NULL, empty, or missing to NULL and loads it as NULL.

Provide an explicit default in the pipeline or use the COLUMN_DEFAULT_IF_NULL option to accept the configured column default instead of attempting to load NULL.

NULL and Empty Handling for JSON Tuples

All the rules for handling NULL, empty, and missing elements that apply to scalars and arrays also apply to tuples. If any part of the selector is NULL, empty, or missing, data pipeline loading converts that value to NULL.

Additionally, because you can apply functions to tuple elements (and not array elements), you can use the NULL_IF function to convert a tuple element to NULL. For example, tuple<<char,varchar>>($a.name, NULL_IF($a.hometown, 'N/A') ) indicates to the pipeline that the string 'N/A' signifies NULL for the hometown element but not for the name element.

Load Binary Data

The Ocient System loads the binary data format using a fixed record length to split a binary stream into chunks that represent records. Each record is available in the SELECT portion of a pipeline definition using a special binary extract syntax $"[5,8]". This operates similarly to a substring function, beginning at byte 5 and taking 8 bytes from that location. The starting index is a 1-based offset, consistent with other SQL arrays and offsets. You can use this syntax to select specific bytes within a record to parse together as a unit.

Binary Selector

Selector

Description

Examples

$"[<start_index>,<length_in_bytes>]"

Extract Bytes — Extracts the bytes beginning at the <start_index> and returning <length_in_bytes> total bytes to extract from the start index.

<start_index> is 1-based.



abc123.45xyz

$"[4,6]" -> "123.45"

This example illustrates the operation using ascii characters, but the Binary Selector returns BINARY data, byte arrays, that may be used in special ways by fixed width processing functions described in Binary Data Transformations.

Example

The Binary Selector takes 8 bytes starting at offset 11 on the fixed-width binary record. Consistent with SQL functions in the Ocient System, the first argument value 11 is the 1-based offset into the byte array.

SQL


The Binary Selector returns BINARY data, not VARCHAR.

Special BINARY transformation functions can operate on this BINARY data. However, if you cast data to the VARCHAR type by using CHAR(), then functions like INT operate on this data as VARCHAR data, not binary data.

When you load binary data into VARCHAR columns, the Ocient System automatically converts from binary to character data using the configured CHARSET_NAME before final loading.

The Ocient System supports special transformation functions that operate uniquely on binary data. With these functions, you can convert binary representations from mainframe systems such as packed decimals, zoned decimals, big and little endian integers (signed and unsigned), and floating point values. For more details, see Binary Data Transformation Functions.

For a complete list of supported options for DELIMITED and CSV data formats, see Binary Extract Options.

Binary Loading Example

If each record in your fixed-width binary schema includes these fields, you can use the SUBSTRING function and the transforms shown in this example.

Column

Start Index

Record Length

Source Data Type

Ocient Target Data Type

first_name

1

20

Character

VARCHAR

last_name

21

20

Character

VARCHAR

age

41

4

Big Endian signed 4-byte integer

INT

total_spent

45

10

Packed Decimal

DECIMAL(10,2)

user_id

55

8

Little Endian unsigned 8-byte long

BIGINT

Each record includes 62 bytes, so the record length RECORD_LENGTH is 62. The encoding of this file is CP500 instead of the default IBM1047 code page. The CREATE PIPELINE SQL statement specifies this encoding.

SQL


This SQL statement:

  • Uses the BINARY SELECTOR to extract names and load them into the respective columns. The Ocient System automatically decodes the values using cp500 and loads them into a VARCHAR column. An explicit cast such as CHAR($"[1, 20]") as first_name works equivalently.
  • Indicates the extraction of four bytes that represent age from bytes 41-44. The statement instructs the casting of these bytes as an integer INT. This function uses the default endianness (big) and treats the bytes as signed. Unsigned values can overflow target columns because integral types are all signed.
  • Extracts the 10 bytes for total_spent using the BINARY SELECTOR, and converts the values using the packed decimal option for the DECIMAL cast. The casting requires specifying the number of decimal points in the source data. In this case, there are 2 decimal points, which matches the target column.
  • Extracts the 8 bytes that represent user_id using the BINARY SELECTOR and casts these bytes to a BIGINT while interpreting the bytes as unsigned with the little endian representation.

Load Parquet Data

The data pipeline functionality enables loading files with this configuration.



Use selectors as you do when loading JSON data to specify data to load.



When you use the FORMAT PARQUET option with an AWS S3 source, the ENDPOINT option is required in the CREATE PIPELINE SQL statement.

Auto-casting in Parquet does not support the automatic conversion to VARCHAR columns. You must explicitly cast data to the CHAR data type when you convert Parquet data that is not string data to a VARCHAR column or VARCHAR function argument.

Schema Evolution

The Ocient System supports schema evolution when you load a set of Parquet files. Specifically, if the pipeline selects a set of Parquet files where an individual file might have more or fewer columns than another, the system attempts to merge those schemas together to support loading without requiring you to create the pipeline again.

For example, the test_table table has three columns.

SQL


You have two Parquet files with these schemas:

Text


However, you must specify how you want to handle the schema evolution within the EXTRACT SQL statement. You can choose to sample the first file only for its schema or sample the entire data set to merge the schemas together. This DDL statement samples on one file.

SQL


The disadvantage is that sampling multiple files can potentially take a long time (scaling with the number of files in the data set) when you execute the CREATE PIPELINE and START PIPELINE SQL statements. If you know that all of the Parquet files have the same schema, use this syntax.

The Ocient System does not support the case where a column within the schema changes type. For example, if col_a is an INT type in one file and a VARCHAR type in another.

The default behavior of schema evolution is inferring the schema from all files. Use this syntax to infer from all files.

SQL


Supported Data Types in Parquet

Parquet data types are separated into primitive and logical types. The Ocient System converts these types to Ocient SQL types. See these tables for the respective conversions.

Parquet Primitive Type

Ocient SQL Type

BOOLEAN

BOOLEAN

INT32

INT

INT64

BIGINT

INT96

TIMESTAMP

FLOAT

FLOAT

DOUBLE

DOUBLE

BYTE_ARRAY

VARCHAR

FIXED_LEN_BYTE_ARRAY

VARCHAR

Parquet Logical Types

Ocient SQL Type

STRING

VARCHAR

UTF8

VARCHAR

ENUM

VARCHAR

UUID

UUID

INT8

TINYINT

INT16

SMALLINT

INT32

INT

INT64

BIGINT

UINT8

SMALLINT

UINT16

INT

UINT32

BIGINT

UINT64

BIGINT

DECIMAL

DECIMAL

DATE

DATE

TIME

TIME

TIME_MILLIS

TIME

TIME_MICROS

TIME

TIMESTAMP

TIMESTAMP

TIMESTAMP_MILLIS

TIMESTAMP

TIMESTAMP_MICROS

TIMESTAMP

DURATION

BIGINT

JSON

VARCHAR

BSON

VARCHAR

  • The INTERVAL data type is not supported.
  • The UINT64 data type can overflow the BIGINT conversion.
  • The DURATION data type conversion to BIGINT preserves the underlying units. For example, the number of microseconds stays as microseconds in the BIGINT data type.

Further, Parquet contains nested types that the Ocient System also converts to SQL types, as shown in this table.

Parquet Nested Types

Ocient SQL Type

LIST

TYPE[]

TUPLE

TUPLE

Parquet Partition Keys

With Parquet, you can load partition keys from a partitioned Parquet file using the file path structure. Use the filename metadata selector along with transformation functions.

Assume files with these file paths.

Shell


Construct the selector using this code to create a DATE field.

SQL


For details about transformation functions, see Transform Data in Data Pipelines.

Parquet Example

Create a data pipeline that loads Parquet files using an AWS S3 bucket. Specify the bucket, endpoint, access key identifier, secret access key, and filter options to find all Parquet files in the specified file path. Use the parquet_base_table table to store the loaded data. Retrieve integer, text, floating point, double, integer, JSON, and BSON fields.

SQL


Related Links