Skip to main content
Loading in differs in subtle ways that depend on the data format of the source. Loading runs with a strict interpretation of source data to allow pipelines to achieve maximum performance. For text-based formats like JSON and DELIMITED, the Ocient System performs no preemptive casting on the data when using a source field selector. For example, with the JSON string { "my_field": 1234 }, the selector $my_field returns the string "1234" not the integer 1234. When you use transformation functions, keep in mind that the Ocient System treats all data in the JSON and DELIMITED formats as text data. While the Ocient System sends data you select to a final target column, the system automatically casts the data in the final step to ensure that the data is compatible with the target column type. See Data Types for Data Pipelines for supported automatic conversion rules. Format-specific differences also appear in the pipelines.

Load ASN.1 Data

You can load data in ASN.1 (Abstract Syntax Notation One) format from binary-encoded ASN.1 files using DER-encoded or BER-encoded files. ASN.1 provides a flexible, schema-driven format commonly used in telecommunications, security, and standardized protocols. This format allows you to extract structured records and map them to relational tables using SQL. The Ocient System requires .ber and .der files to contain one or more concatenated DER-encoded or BER-encoded values with the specified record type. The system decodes each record independently and maps it into a record.

ASN.1 Type Mapping

The system automatically converts all decoded ASN.1 fields to their JSON-equivalent representations.
The ASN.1 schema must consistently use implicit or explicit tagging. You must specify clear tagging so the system can resolve field names during extraction.
ASN.1 TypeJSON MappingDescription
INTEGERNumberThe number, specified as an integer.

Example: 5
BOOLEANBooleanThe logical value, specified as a Boolean.

Example: true
VisibleStringStringThe ASN.1 type assumes UTF-8 encoding.

Example: "Data Analyst"
IA5StringStringThe International Alphabet number 5 (IA5) includes most of the ASCII alphabet but can contain other characters.

Example: "ABCD"
UTF8StringStringThe ASN.1 type assumes UTF-8 encoding.

Example: "abcd"
OCTET STRINGHexadecimal stringThe hexadecimal string has the prefix 0x.

Example: "0x4fa2"
BIT STRINGHexadecimal stringThe hexadecimal string with preserved bit alignment has the prefix 0x.

Example: "011110"
OBJECT IDENTIFIERStringThe unique identifier for an object using dotted representation.

Example: "1.2.840.113549"
ENUMERATEDNumber or symbolic stringThe enumerated string, if the data resolves into a list of named items.

Example: "red"
ANYHexadecimal stringThe hexadecimal string with raw encoded bytes of the embedded value.

Example: "0x616263"
SEQUENCE, SETJSON objectObject with fields that are accessible by their component names.

Example:
WeatherRecord ::= SEQUENCE { temperature INTEGER, humidity INTEGER, cloudy BOOLEAN }
SEQUENCE OF, SET OFJSON arrayArray of objects where the elements are fully decoded.

Example:
WeatherEvents ::= SEQUENCE OF WeatherRecord
CHOICEJSON objectAn object with fields of any type. Identify the active alternative by the key.

Example:
WeatherReport ::= CHOICE { sunny SEQUENCE { uv\_index INTEGER }, rainy SEQUENCE { precipitation INTEGER } }
DATEStringDate in the format "yyyyMMdd". You must explicitly declare the date, or the system must infer it using tagging.

Example: "20240518"
TIMEStringTime in the format "HHmmss".

Example: "113056"
DATE-TIMEStringDate and time in the format "yyMMddHHmmss'Z'". This format supports only UTC. You must include 'Z' as a literal.

Example: "250920101508'Z'"
To access fields, use dot notation for access: $sequence.fieldName. For arrays (such as SEQUENCE OF), use bracket notation: $sequenceOf[].fieldElement. Default values and optional fields follow standard ASN.1 rules. If you omit a field, the system evaluates it as NULL.

ASN.1 Loading Example

Assume you have the personnel_records.asn file in the ASN.1 data format. The file contains the definition of a personnel record Example.PersonnelRecord. The ASN.1 file contains data for the personnel record, child, personnel name, employee number of the personnel, and the date.
Text
Example DEFINITIONS IMPLICIT TAGS ::=
BEGIN
    PersonnelRecord ::= [APPLICATION 0] SET {
		name         [0] Name,
		title        [1] VisibleString,
		number       [2] EmployeeNumber,
		dateOfHire   [3] Date,
		nameOfSpouse [4] Name,
		children     [5] SEQUENCE OF ChildInformation DEFAULT {}
    }
    ChildInformation ::= SET {
		name        [0] Name,
		dateOfBirth [1] Date
    }
    Name ::= [APPLICATION 1] SEQUENCE {
		givenName  [0] VisibleString,
		initial    [1] VisibleString,
		familyName [2] VisibleString
    }
    EmployeeNumber ::= [APPLICATION 2] INTEGER
    Date ::= [APPLICATION 3] VisibleString
END
Create a table to contain the personnel record. The table contains a subset of the data:
  • first_name — First name
  • initial — Initial of the middle name
  • family_name — Last name
  • title — Job title
  • number — Employee number
  • date_of_hire — Hire date
SQL
CREATE TABLE personnel_records(
    first_name VARCHAR NOT NULL,
    initial VARCHAR NOT NULL,
    family_name VARCHAR NOT NULL,
    title VARCHAR NOT NULL,
    number BIGINT NOT NULL,
    date_of_hire DATE NOT NULL
);
Create the data pipeline personnel_pipeline to load the personnel record into the personnel_records table using an S3 bucket. Specify the bucket, endpoint, access key identifier, secret access key, and filter options to find the ASN.1 DER-encoded file personnel_records.der in the specified file path. Use the URL file path http://cos/filepath/asn1/personnel_records.asn and record type Example.PersonnelRecord. Access the name sequence using dot notation for the first name, middle initial, and last name fields. The pipeline definition transforms the hire date to the 'yyyyMMdd' format.
SQL
CREATE PIPELINE personnel_pipeline
    SOURCE S3
        BUCKET 'misc'
        ENDPOINT 'http://cos'
        ACCESS_KEY_ID ''
        SECRET_ACCESS_KEY ''
        FILTER_GLOB 'user/asn1/personnel_records.der'
    EXTRACT
        FORMAT 'asn.1'
        SCHEMA {
            URL 'http://cos/filepath/asn1/personnel_records.asn'
            RECORD_TYPE 'Example.PersonnelRecord'
        }
INTO personnel_records
SELECT
    $name.givenName AS first_name,
    $name.initial as initial,
    $name.familyName as family_name,
    $title as title,
    $number as number,
    TO_DATE($dateOfHire, 'yyyyMMdd') as date_of_hire;

Load Avro Data

The Ocient System enables you to load data in the format. You can use a streaming source with a file-based source only. Load an inline schema definition or use a schema configuration. Use a schema inference from files with embedded schemas. Field selectors in Avro follow the same format as selectors in JSON and formats.
The Ocient System treats Avro selectors as lowercase. To use case-sensitive selectors, you must enclose the selector in double quotation marks. For example, $"testSelector".

Inline Schema

Specify a JSON string in the Avro schema format in the INLINE option of the schema definition of the CREATE PIPELINE SQL statement. Inline schemas assume that all records follow the defined schema exactly. These schemas do not support schema evolution.

Schema Inference from a File

The system can infer the schema from a file that has embedded schemas. The file is a named object container file. Use the INFER_FROM option in the CREATE PIPELINE SQL statement to specify sampling one file.

Schema Evolution

When you create a data pipeline, the pipeline has a fixed target schema (specified by the SELECT clause). Individual files might have different schemas. The target schema must be backward-transitive compatible with the other schemas. Individual files might have different schemas. If the other schemas change, the system automatically attempts to fit data into the target schema. In this case, the other schemas must be forward compatible with the target schema. The system ignores changes to any unused fields from the target schema.
Multiple schemas impact the performance of the data pipeline execution. For best performance, use a single schema for all data.

Avro Type Mapping

The Ocient System converts these Avro data types to Ocient SQL types. This table shows the respective conversions.
Avro TypeOcient SQL Type
nullNULL
booleanBOOLEAN
intINT
longBIGINT
floatFLOAT
doubleDOUBLE
bytesVARBINARY
stringVARCHAR
recordSelect subfields of the record using the standard pipeline selector syntax, e.g., $myrecord.subfield.
enumVARCHAR
arrayARRAY
mapSelect entries of the map using the key with the standard pipeline selector syntax, e.g., $mymap.key.
fixedHASH
decimalDECIMAL
big-decimalDECIMAL
uuidUUID
dateDATE
time-*TIME
timestamp-*TIMESTAMP
local-timestamp-*TIMESTAMP
durationThe system converts duration to a record with integer subfields for months, days, and milliseconds.
unionThe system converts a union between NULL and another type to a nullable version of that type. In contrast, the system converts a union containing multiple non-NULL types to a string.

Avro Loading Examples

Create the users table with these columns:
  • id — Universally Unique IDentifier (UUID) of the user
  • firstname — First name of the user
  • lastname — Last name of the user
  • birthyear — Year of birth
  • groups — List of groups where the user belongs
SQL
CREATE TABLE users(
    id         UUID NOT NULL,
    firstname  VARCHAR(255) NOT NULL,
    lastname   VARCHAR(255) NOT NULL,
    birthyear  INT,
    groups     VARCHAR(255)[] NOT NULL DEFAULT 'char[]'
);
These examples use this table as the target table for the load. Load Avro Data from Files Assume you have user data in Avro format in multiple files in the /data/users directory. Create the users_pipeline data pipeline for the Avro files *.avro containing user data. The schema configuration instructs the system to infer from one file using the INFER_FROM option. Access the array of strings for the groups column.
SQL
CREATE PIPELINE users_pipeline
SOURCE filesystem
    FILTER '/data/users/*.avro'
EXTRACT
    FORMAT avro
    SCHEMA {
        INFER_FROM 'sample_file'
    }
INTO users
SELECT
    $id         AS id,
    $firstname  AS firstname,
    $lastname   AS lastname,
    $birthyear  AS birthyear,
    $groups[]   AS groups;
Load Avro Data from an Inline Schema Definition Assume you have user data in Avro format in multiple files in the /data/users directory. Create the users_pipeline data pipeline for the Avro files *.avro containing user data. The schema configuration instructs the system to use an inline schema definition with the INLINE option. Access the array of strings for the groups column.
SQL
CREATE PIPELINE users_pipeline
SOURCE filesystem
    FILTER '/data/users/*.avro'
EXTRACT
    FORMAT avro
    SCHEMA {
        INLINE '{
            "type": "record",
            "name": "User",
            "namespace": "test.users",
            "fields": [
                { "name": "id",        "type": { "type": "string", "logicalType": "uuid" } },
                { "name": "firstname", "type": "string" },
                { "name": "lastname",  "type": "string" },
                { "name": "birthyear", "type": ["null","int"], "default": null },
                { "name": "groups",    "type": { "type": "array", "items": "string" } }
            ]
        }'
    }
INTO users
SELECT
    $id         AS id,
    $firstname  AS firstname,
    $lastname   AS lastname,
    $birthyear  AS birthyear,
    $groups[]   AS groups;

Load Binary Data

The Ocient System loads the binary data format using a fixed record length to split a binary stream into chunks that represent records. Each record is available in the SELECT portion of a pipeline definition using a special binary extract syntax $"[5,8]". This operates similarly to a substring function, beginning at byte 5 and taking 8 bytes from that location. The starting index is a 1-based offset, consistent with other SQL arrays and offsets. You can use this syntax to select specific bytes within a record to parse together as a unit.

Binary Selector

SelectorDescriptionExamples
$"[<start_index>,<length_in_bytes>]"Extract Bytes — Extracts the bytes beginning at the <start_index> and returning <length_in_bytes> total bytes to extract from the start index.
<start_index> is 1-based.
abc123.45xyz
$"[4,6]" -> "123.45"
This example illustrates the operation using ascii characters, but the Binary Selector returns BINARY data, byte arrays, that may be used in special ways by fixed width processing functions described in Binary Data Transformations.
Example The Binary Selector takes 8 bytes starting at offset 11 on the fixed-width binary record. Consistent with SQL functions in the Ocient System, the first argument value 11 is the 1-based offset into the byte array.
$"[11, 8]"
The Binary Selector returns BINARY data, not VARCHAR. Special BINARY transformation functions can operate on this BINARY data. However, if you cast data to the VARCHAR type by using CHAR(), then functions like INT operate on this data as VARCHAR data, not binary data. When you load binary data into VARCHAR columns, the Ocient System automatically converts from binary to character data using the configured CHARSET_NAME before final loading. The Ocient System supports special transformation functions that operate uniquely on binary data. With these functions, you can convert binary representations from mainframe systems such as packed decimals, zoned decimals, big and little endian integers (signed and unsigned), and floating point values. For more details, see Binary Data Transformation Functions. For a complete list of supported options for DELIMITED and CSV data formats, see Binary Extract Options.

Binary Loading Example

If each record in your fixed-width binary schema includes these fields, you can use the SUBSTRING function and the transforms shown in this example.
ColumnStart IndexRecord LengthSource Data TypeOcient Target Data Type
first_name120CharacterVARCHAR
last_name2120CharacterVARCHAR
age414Big Endian signed 4-byte integerINT
total_spent4510Packed DecimalDECIMAL(10,2)
user_id558Little Endian unsigned 8-byte longBIGINT
Each record includes 62 bytes, so the record length RECORD_LENGTH is 62. The encoding of this file is CP500 instead of the default IBM1047 code page. The CREATE PIPELINE SQL statement specifies this encoding.
SQL
CREATE PIPELINE binary_users_pipeline
SOURCE
	S3 ...
FORMAT
	BINARY
	RECORD_LENGTH 62
	CHARSET_NAME 'cp500'
INTO public.users
SELECT
	$"[1, 20]" as first_name,
	$"[21, 20]" as last_name,
	INT($"[41, 4]") as age,
	DECIMAL($"[45, 10]", 'packed', 2) as total_spent,
	BIGINT($"[55, 8]", 'unsigned', 'little') as user_id;

This SQL statement:
  • Uses the BINARY SELECTOR to extract names and load them into the respective columns. The Ocient System automatically decodes the values using cp500 and loads them into a VARCHAR column. An explicit cast such as CHAR($"[1, 20]") as first_name works equivalently.
  • Indicates the extraction of four bytes that represent age from bytes 41-44. The statement instructs the casting of these bytes as an integer INT. This function uses the default endianness (big) and treats the bytes as signed. Unsigned values can overflow target columns because integral types are all signed.
  • Extracts the 10 bytes for total_spent using the BINARY SELECTOR, and converts the values using the packed decimal option for the DECIMAL cast. The casting requires specifying the number of decimal points in the source data. In this case, there are 2 decimal points, which match the number in the target column.
  • Extracts the 8 bytes that represent user_id using the BINARY SELECTOR and casts these bytes to a BIGINT while interpreting the bytes as unsigned with the little endian representation.

Load Delimited and CSV Data

When you load data from delimited or CSV files, the Ocient System tokenizes the data during loading. The system detects records and fields in the input data during pipeline execution. You can reference fields and use them in combination with transformation functions before the system stores values in the column of a target table. Files must be located in these allowed directories:
  • /tmp directory
  • The temporary directory you configure using the streamloader.extractorEngineParameters.tempDir configuration option (default path is /var/opt/ocient/tmp).
  • The directory list specified by the streamloader.extractorEngineParameters.configurationOption.filesystem.access.directories configuration option (default is an empty list).
Files cannot be located in these blocked directories:
  • /etc
  • /bin
  • /sbin
  • /lib
  • /lib64
  • /usr
  • /boot
  • /proc
  • /sys
  • /run
  • /root
  • /var/lib
  • /var/log
  • /var/run
  • /var/cache
Referencing fields of the source data for the formats happens by using a field index. The index is a number that follows the dollar sign $. To maintain consistency with SQL array semantics, the field indexes start at 1. Reference the first field of tokenized source records for the DELIMITED and CSV formats as $1.
For the BINARY format, $0 represents the entire record. In this case, you must specify $0 in combination with the SUBSTRING function to extract specific bytes from the source data.
For a complete list of supported options for DELIMITED and CSV data formats, see Delimited and CSV Extract Options.

Delimited Loading Example

Use this example delimited data.
Text
iphone|60607|viewed|502|293.99|[shopping,news]
For this example row, this table shows the field references for each value in the row.
Field ReferenceValue
$1iphone
$260607
$3viewed
$4502
$5293.99
$6\[]\[‘shopping’,‘news’]
To load this data in a pipeline with the DELIMITED data format, this CREATE PIPELINE statement specifies the | character for the field delimiter. This statement loads data into AWS S3. The SELECT statement uses fields 1, 2, 3, 5, and 6 of the source data. The statement specifies that the system should not load field 4 to the target table. Field 6 is an array of data matching the default array settings for delimited data. You can indicate this with the array brackets like $6[] to load into a CHAR[] typed column. The outer casting functions in this example are optional and shown for completeness. If they are omitted, the pipeline automatically casts the source fields to the target column type.
SQL
CREATE PIPELINE delimited_pipeline
SOURCE s3
    ...
FORMAT delimited
    FIELD_DELIMITERS ['|']
    ...
SELECT
    CHAR($1) as device_model,
    INT($2) as zip,
    INT($3) as amount,
    DOUBLE($5) as price,
    CHAR[]($6[]) as categories;

Load JSON Data

The data pipeline syntax enables the load of JSON data, including nested scalars, arrays, and points ST_POINT.
Strict Loading and TransformationsWhen you use transformation functions, remember that the Ocient System treats all data in JSON and DELIMITED format as text data, not the logical data type.For example, if you specify the JSON string { "my_timestamp": 1709208000000 }, the selector $my_timestamp returns the string "1709208000000" and not the integer 1709208000000.As a result, if you cast this data into a timestamp column, such as TIMESTAMP($my_timestamp) as created_at, the Ocient System returns an error. The conversion fails because the cast function assumes you are specifying TIMESTAMP(VARCHAR), which assumes a format like YYYY-MM-DD HH🇲🇲ss[.SSSSSSSSS].To correct this issue, cast the value explicitly to make use of the TIMESTAMP(BIGINT) function that treats the argument as milliseconds after the epoch as in TIMESTAMP(BIGINT($my_timestamp)) as created_at.

Supported JSON Selectors

JSON selectors consist of $ followed by a dot-separated list of JSON keys. If a key refers to an array, it is followed by a set of brackets [] to correspond to its dimensionality. If the square brackets contain an index, like [1], then the selector refers to an array element.
The Ocient System treats JSON selectors as lowercase. To use case-sensitive selectors, you must enclose the selector in double quotation marks. For example, $"testSelector". With case-sensitive selectors having multiple JSON keys, each key needs double quotation marks. For example, $"testData"."Responses"."SuccessResponse".
For special characters (any identifier that starts with any character other than a letter or contains any character that is not a letter, number, or an underscore) or reserved SQL keywords (such as SELECT), you must enclose such selectors in double quotation marks. For example, if you have a JSON document { "test-field": 123 }, then the selector for the query should be $"test-field". The Ocient System does not support identifiers with a backslash as the last character in the key name. This table shows the selector and provides its description. The cells in the last column of the table show an example for each selector. First, the cell shows example data in JSON format. Then, the cell shows the use of the selector and its output after the arrow.
SelectorDescriptionExamples
$aScalar — Refers to key a at the top of a JSON document, where the value of a is a scalar.{ "first_name": "Chris" }

$first_name -> "Chris"
$a[]Array — Refers to key a at the top of a JSON document, where the value of a is a one-dimensional array.{ "names": ["Chris", "Joe"] }
$names[] -> ["Chris","Joe"]
$a[1]Array Element Selection — Selects a single element from the array a at the top of a JSON document, where the value of a is a one-dimensional array. Also works on multi-dimensional arrays. Indexes start at 1.{ "names": ["Chris", "Joe"] }
$names[2] -> "Joe"
$a[][]Multi-dimensional Array — Refers to key a at the top of a JSON document, where the value of a is a two-dimensional array.{ "coordinates": [ [87.31, 18.25],[65.22, 19.41]] }
$coordinates[][] -> [[87.31,18.25],[65.22,19.41]]
$a.b.cNested Selector — Drills into the value at key a, then value at key b, and refers to the value at key c, where none of the values are arrays.{ "user": { "name": "Chris" } }
$user.name -> "Chris"
$a[].bArray Projection — A projection applies a JSON selector on all elements in an array, returning an array.
Array brackets in the notation indicate which items are the arrays.
The a[] is an array. The b selector is applied to each element in the array a.
{
"users": [
{
"name": "Chris",
"orders": [
{ "order_id": 1,
"subtotal": 19.29 },
{ "order_id": 2,
"subtotal": 16.71 }
]
},
{
"name": "Joe",
"orders": [
{ "order_id": 3,
"subtotal": 17.29 },
{ "order_id": 4,
"subtotal": 22.49 }
]
}
]
}
$users[].name -> ["Chris", "Joe"]
$a.b[].c.d[]Multi-level Array Projection — Applies selectors to each element in the specified arrays and allows multiple levels of objects and arrays for the selection of an N-dimensional array of values.
The values of b and d are arrays, resulting in two-dimensional arrays. The selector c.d[] is projected into the elements in the array b.
{
"users": [
{
"name": "Chris",
"orders": [
{ "order_id": 1,
"subtotal": 19.29 },
{ "order_id": 2,
"subtotal": 16.71 }
]
},
{
"name": "Joe",
"orders": [
{ "order_id": 3,
"subtotal": 17.29 },
{ "order_id": 4,
"subtotal": 22.49 }
]
}
]
}
$users[].orders[].order_id -> [[1,2],[3,4]]
$a.b[1].c.d[]Array Slice — Selects the first element from the array b and drills into c and then the array d. JSON array selectors have an index that starts at 0.{
"users": [
{
"name": "Chris",
"orders": [
{ "order_id": 1,
"subtotal": 19.29 },
{ "order_id": 2,
"subtotal": 16.71 }
]
},
{
"name": "Joe",
"orders": [
{ "order_id": 3,
"subtotal": 17.29 },
{ "order_id": 4,
"subtotal": 22.49 }
]
}
]
}
$users[1].orders[].subtotal -> [17.29,22.49]
$a.{b,c}Tuple Selector — Selects the values from an object as a tuple. You must cast the values to an appropriate TUPLE data type.{ "user": { "first_name": "John", "last_name": "Doe" }}
$user.{first_name,last_name} -> <<"John","Doe">>
$a[].{b,c}Array of Tuples Selector — Selects the values from an array of objects as an array of tuples. You must cast the values to an appropriate TUPLE[] data type.{ "users": [ { "first_name": "John", "last_name": "Doe" }, { "first_name": "Steve", "last_name": "Smith" } ]}
$users[].{first_name,last_name} -> [<<"John","Doe">>,<<"Steve","Smith">>]
$a[_]Flatten Array Selector — Selects the values of an array and flattens the dimension indicated by the underscore by one level.{
a: {
b: [
{c: [1,2,3] },
{c: [4,5,6] }
]
}
}
$a.b[_].c -> [1,2,3,4,5,6]
$a[!]Compact Array Selector — Selects the values of an array and removes any NULL values from the array at this level in the array.{
a: {
b: {
c: [1,null,2,null,3]
}
}
}
$a.b.c[!] -> [1,2,3]
For more examples of using JSON selectors in data pipelines, see JSON Selectors Examples in Data Pipelines.

NULL and Empty Handling for JSON Scalars

The Ocient System handles all JSON NULL, empty, and missing values in the same way. The system loads these values as NULL. These values fail to load into non-nullable columns. Provide an explicit default in the pipeline using IF_NULL or COALESCE or use the COLUMN_DEFAULT_IF_NULL option to accept the configured column default instead of attempting to load NULL values.

NULL and Empty Handling for JSON Arrays

The Ocient System handles NULL, empty, and missing values the same way for arrays as for scalars. The system converts a value that is NULL, empty, or missing to NULL and loads it as NULL. Provide an explicit default in the pipeline or use the COLUMN_DEFAULT_IF_NULL option to accept the configured column default instead of attempting to load NULL.

NULL and Empty Handling for JSON Tuples

All the rules for handling NULL, empty, and missing elements that apply to scalars and arrays also apply to tuples. If any part of the selector is NULL, empty, or missing, data pipeline loading converts that value to NULL. Additionally, because you can apply functions to tuple elements (and not array elements), you can use the NULL_IF function to convert a tuple element to NULL. For example, tuple<<char,varchar>>($a.name, NULL_IF($a.hometown, 'N/A') ) indicates to the pipeline that the string 'N/A' signifies NULL for the hometown element but not for the name element.

Load Parquet Data

The data pipeline functionality enables loading Parquet files with this configuration. File Configuration
  • Files should have row groups of less than 128 MB. Larger row groups can impact memory usage during loading, and row groups of 512 MB can cause loading failures on 1 TB or more data sets.
  • Encoding fields in a Parquet file reduces the space of the file on disk but can impact memory usage during loading. Enable encoding on fields that you expect to have less than 256 unique values and for fields that contain short strings. You do not have to encode other fields.
Multiple Files
  • You can load row groups of multiple Parquet files in parallel. For large data sets, load the data set as multiple files.
  • Loading files with differing schemas is not supported.
Use selectors as you do when loading JSON data to specify data to load.
You must select a leaf element, an array, or a tuple with your selector. This is stricter than using JSON selectors, which can directly select array fields and JSON object fields.Example: {"a": [1,2,3], "b": {"c": 1}}You can extract with any of the selectors in JSON: $a, $a[], $b, $b.cHowever, Parquet only allows for the selectors: $a[], $b.cThis example assumes this schema:
// List<String> (list non-null, elements nullable)
required group my_list (LIST) {
  repeated group list {
    optional binary element (UTF8);
  }
}
The selector must be $my_list[], which includes the array syntax.
For details, see Parquet Selectors Examples in Data Pipelines.
When you use the FORMAT PARQUET option with an AWS S3 source, the ENDPOINT option is required in the CREATE PIPELINE SQL statement.
Auto-casting in Parquet does not support the automatic conversion to VARCHAR columns. You must explicitly cast data to the CHAR data type when you convert Parquet data that is not string data to a VARCHAR column or VARCHAR function argument.
The Ocient System treats Parquet selectors as lowercase. To use case-sensitive selectors, you must enclose the selector in double quotation marks. For example, $"testSelector".

Schema Evolution

The Ocient System supports schema evolution when you load a set of Parquet files. Specifically, if the pipeline selects a set of Parquet files where an individual file might have more or fewer columns than another, the system attempts to merge those schemas together to support loading without requiring you to create the pipeline again. For example, the test_table table has three columns.
SQL
CREATE TABLE test_table (
  col_a INT NULL,
  col_b VARCHAR NULL,
  col_c VARCHAR NULL
);
You have two Parquet files with these schemas:
Text
message file1_schema {
  OPTIONAL INT32 col_a;
  OPTIONAL BYTE_ARRAY col_b (UTF8);
}

message file2_schema {
  OPTIONAL BYTE_ARRAY col_b (UTF8);
  OPTIONAL BYTE_ARRAY col_c (UTF8);
}
However, you must specify how to handle the schema evolution within the EXTRACT SQL statement. You can choose to sample the first file only for its schema or sample the entire data set to merge the schemas together. This DDL statement samples on one file.
SQL
EXTRACT
  FORMAT parquet
  SCHEMA (INFER_FROM sample_file)
The disadvantage is that sampling multiple files can potentially take a long time (scaling with the number of files in the data set) when you execute the CREATE PIPELINE and START PIPELINE SQL statements. If you know that all of the Parquet files have the same schema, use this syntax. The Ocient System does not support the case where a column within the schema changes type. For example, if col_a is an INT type in one file and a VARCHAR type in another. The default behavior of schema evolution infers the schema from one file. Use this syntax to infer from one file.
SQL
EXTRACT
  FORMAT parquet
  SCHEMA (INFER_FROM sample_file)

Parquet Type Mapping

Parquet data types are separated into primitive and logical types. The Ocient System converts these types to Ocient SQL types. See these tables for the respective conversions.
Parquet Primitive TypeOcient SQL Type
BOOLEANBOOLEAN
INT32INT
INT64BIGINT
INT96TIMESTAMP
FLOATFLOAT
DOUBLEDOUBLE
BYTE_ARRAYVARCHAR
FIXED_LEN_BYTE_ARRAYVARCHAR
Parquet Logical TypesOcient SQL Type
STRINGVARCHAR
UTF8VARCHAR
ENUMVARCHAR
UUIDUUID
INT8TINYINT
INT16SMALLINT
INT32INT
INT64BIGINT
UINT8SMALLINT
UINT16INT
UINT32BIGINT
UINT64BIGINT
DECIMALDECIMAL
DATEDATE
TIMETIME
TIME_MILLISTIME
TIME_MICROSTIME
TIMESTAMPTIMESTAMP
TIMESTAMP_MILLISTIMESTAMP
TIMESTAMP_MICROSTIMESTAMP
DURATIONBIGINT
JSONVARCHAR
BSONVARCHAR
  • The INTERVAL data type is not supported.
  • The UINT64 data type can overflow the BIGINT conversion.
  • The DURATION data type conversion to BIGINT preserves the underlying units. For example, the number of microseconds stays as microseconds in the BIGINT data type.
Further, Parquet contains nested types that the Ocient System also converts to SQL types, as shown in this table.
Parquet Nested TypesOcient SQL Type
LISTTYPE[]
TUPLETUPLE

Parquet Loading Example

Create a data pipeline that loads Parquet files using an AWS S3 bucket. Specify the bucket, endpoint, access key identifier, secret access key, and filter options to find all Parquet files in the specified file path. Use the parquet_base_table table to store the loaded data. Retrieve the integer, text, floating point, double, integer, JSON, and BSON fields.
SQL
CREATE PIPELINE testpipeline
    SOURCE S3
        BUCKET 'testbucket'
        ENDPOINT 'https://endpoint'
        ACCESS_KEY_ID ''
        SECRET_ACCESS_KEY ''
        FILTER_GLOB '/data/*/2024/*/*.parquet'
        PREFIX '/data/orders/2024/11/'
    EXTRACT
        FORMAT PARQUET
INTO parquet_base_table
SELECT $int32_field AS int32_field,
    $utf8_field as utf8_field,
    $float_field as float_field,
    $double_field as double_field,
    $int64_field as int64_field,
    $json_field as json_field,
    $bson_field as bson_field;

Parquet File-Partitioned Data

With Parquet, you can load file-partitioned data from Parquet files using the file path structure. Use the filter set in the file path using the naming standards. Assume files with these file paths.
Shell
s3://data/orders/2024/11/dt=2024-11-24/file.parquet
.../dt=2024-11-25/file.parquet
.../dt=2024-11-26/file.parquet
...
Load the data values in the Parquet file partitions using the METADATA function with the Hive partition syntax and the specified partition key dt from the file paths.
SQL
SELECT int32_field,
    utf8_field,
    float_field,
    double_field,
    int64_field,
    json_field,
    bson_field,
    METADATA('hive_partition','dt') AS file_date;
For details about this syntax, see Load Metadata and File-Based Partitioned Data in Data Pipelines.

Load XML Data

You can load data in XML format into the Ocient System. The system supports XML tags, basic elements, nested elements, and CDATA, but it does not support XML arrays and attributes.

Supported XML Selectors

Like JSON selectors, you can select XML data using the $ symbol followed by a list of dot-separated JSON keys.
The system treats JSON selectors as lowercase. To use case-sensitive selectors, you must enclose the selector in double quotation marks, such as $"testSelector". With case-sensitive selectors having multiple JSON keys, each key needs double quotation marks, such as $"testData"."Responses"."SuccessResponse".
The system does not support JSON array and tuple selectors.

XML Loading Example

Assume an XML file with this data.
XML
<root>
  <person>
    <name>Barbara Smith</name>
    <address>
      <city>New York</city>
      <zip>12345</zip>
    </address>
    <note><![CDATA[Personal IP <b>127.0.0.1</b>]]></note>
  </person>
</root>
Create a table to contain the IP address record:
  • name — Name
  • city — Name of the city
  • zip — Zip code
  • personal_ip — IP address
SQL
CREATE TABLE example_xml(
    name VARCHAR NOT NULL,
    city VARCHAR NOT NULL,
    zip VARCHAR NOT NULL,
    personal_ip IPV4 NOT NULL
);
Create a data pipeline xml_pipeline that loads the XML file test.xml using an AWS S3 bucket. Specify the bucket, endpoint, access key identifier, secret access key, and filter options to find the XML file in the specified path. Use the example_xml table to store the loaded data. Use JSON selectors to parse the file and data in each XML tag. The system parses the CDATA section in the note tag as the literal string Personal IP <b>127.0.0.1</b>. Use the SUBSTRING function to extract the IP address and then transform it into an IPV4 type with the IPV4 function. For details, see the SUBSTRING and IPV4 functions.
SQL
CREATE PIPELINE xml_pipeline
    SOURCE S3
        BUCKET 'testbucket'
        ENDPOINT 'https://endpoint'
        ACCESS_KEY_ID ''
        SECRET_ACCESS_KEY ''
        FILTER_GLOB '/data/text.xml'
    EXTRACT
        FORMAT XML
INTO example_xml
SELECT
    $"root"."person"."name" as name,
    $"root"."address"."city" as city,
    $"root"."address"."zip" as zip,
    IPV4(SUBSTRING($"root"."person"."note", 16, 9)) as personal_ip;
JSON Selectors Examples in Data Pipelines Data Pipelines Reference Load Metadata and File-Based Partitioned Data in Data Pipelines
Last modified on May 27, 2026