Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.ocient.com/llms.txt

Use this file to discover all available pages before exploring further.

The data pipeline functionality enables you to load data in the data format. Use these examples to see how you can access the structure in different ways. For an overview of loading Parquet data, see Data Pipeline Load of Parquet Data from S3.

Work with Parquet Selectors

The SELECT statement for a Parquet file uses the same syntax as the existing JSON selectors. This selection applies to lists, structures, and maps. The System retrieves its list of requested columns from the SELECT SQL statement. The columns correspond to the field names of the schema for the Parquet files for the load. The load retrieves required columns from the Parquet source files and leaves the other columns alone. This example snippet shows a list of ["id", "name", "array[]", "a[].project", "my_map.key_name"] as the fields to select from the data. The statement only retrieves the columns for ["id", "name", "array", "a", "my_map"] in the row groups.
SQL
SELECT
    $id                         as id,
    $name                       as name,
    $array[]                    as array,
    $a[].project                as projection,
    $my_map.key_name            as key_col
...
You must select a leaf element, an array, or a tuple with your selector. This is stricter than using JSON selectors, which can directly select array fields and JSON object fields.Example: {"a": [1,2,3], "b": {"c": 1}}You can extract with any of the selectors in JSON: $a, $a[], $b, $b.cHowever, Parquet only allows for the selectors: $a[], $b.cThis example assumes this schema:
// List<String> (list non-null, elements nullable)
required group my_list (LIST) {
  repeated group list {
    optional binary element (UTF8);
  }
}
The selector must be $my_list[], which includes the array syntax.

Work with Complex Parquet Selectors

These examples show how to load a Parquet list and convert it to an array and how to load a map and convert it to a tuple.

Load List as an Array Example

Create the tips table to store the tip inquiries a city receives. The columns are:
  • id — Identifier as an integer
  • categories — Categories as a list of strings
  • mayor — Mayor as a string
  • mentions — Mentions as a list of strings
  • service_name — Service name as a non-nullable string
  • tags — Tags as a list of strings
  • url — URL as a string
  • username — Username as a string
  • text — Text description as a non-nullable string
  • venue_name — Venue name as a non-nullable string
SQL
CREATE TABLE tips(
    id INT NOT NULL,
    categories VARCHAR(255)[],
    mayor VARCHAR(255),
    mentions VARCHAR(255)[],
    service_name VARCHAR(255) NOT NULL,
    tags VARCHAR(255)[],
    url VARCHAR(255),
    username VARCHAR(255),
    text VARCHAR(255) NOT NULL,
    venue_name VARCHAR(255) NOT NULL);
Create the tips_pipeline data pipeline for loading the tip data. Specify the S3 source with bucket ocient-docs, object key metabase_samples/parquet/tips.parquet, and endpoint https://s3.us-east-1.amazonaws.com. The source format is Parquet.
SQL
CREATE PIPELINE tips_pipeline
    SOURCE S3
        BUCKET 'ocient-docs'
        OBJECT_KEY 'metabase_samples/parquet/tips.parquet'
        ENDPOINT 'https://s3.us-east-1.amazonaws.com'
    EXTRACT
        FORMAT parquet
INTO tips
SELECT
    $id AS id,
    $source.categories[] AS categories,
    $source.mayor AS mayor,
    $source.mentions[] AS mentions,
    $source."service" AS service_name,
    $source.tags[] AS tags,
    $source.url AS url,
    $source.username AS username,
    $text AS text,
    $venue.name AS venue_name;
Start the data pipeline. The data pipeline loads the categories, mentions, and tags columns as arrays.
SQL
START PIPELINE tips_pipeline;

Load Map as a Tuple Example

This example uses the file /tmp/to-load/locations.parquet that contains a MAP<STRING, DOUBLE> field named coordinates. Each value of the map field contains x and y entries for the coordinates. Create the locations table to store the map values. The cartesian_coords column is a tuple with two elements for the x and y coordinates.
SQL
CREATE TABLE locations(
    cartesian_coords TUPLE<<DOUBLE, DOUBLE>> NOT NULL);
Create the locations_pipeline data pipeline for loading the maps data. Specify the filter for the Parquet file /tmp/to-load/locations.parquet. The source format is Parquet. Select the x and y coordinates from the map.
SQL
CREATE PIPELINE locations_pipeline
    SOURCE filesystem
        FILTER '/tmp/to-load/locations.parquet'
    EXTRACT
        FORMAT parquet
INTO locations
SELECT
    TUPLE($coordinates.x, $coordinates.y) AS cartesian_coords;
Start the data pipeline. The data pipeline loads the cartesian_coords column as a tuple.
SQL
START PIPELINE locations_pipeline;
Data Types for Data Pipelines Data Formats for Data Pipelines Data Pipeline Load of JSON Data from Kafka JSON Selectors Examples in Data Pipelines Data Pipeline Load of Parquet Data from S3 Monitor Data Pipelines
Last modified on May 27, 2026