Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.ocient.com/llms.txt

Use this file to discover all available pages before exploring further.

A common setup for loading files in a batch into the System is to load from a bucket on S3 with time-partitioned data. Often, you must perform a batch load repeatedly to load new files. The Ocient System uses data pipelines to transform each document into rows in one or more different tables. The loading and transformation capabilities use a simple SQL-like syntax for transforming data. This tutorial guides you through a simple example load using a small data set in format. The data in this example comes from a test set for the Business Intelligence tool.

Parquet Loading Recommendations

Follow this set of recommendations for an optimal loading experience of Parquet files. File Configuration
  • Files should have row groups of less than 128 MB. Larger row groups can impact memory usage during loading, and row groups of 512 MB can cause loading failures on 1 TB or more data sets.
  • Encoding fields in a Parquet file reduces the space of the file on disk but can impact memory usage during loading. Enable encoding on fields that you expect to have less than 256 unique values and for fields that contain short strings. You do not have to encode other fields.
Multiple Files
  • You can load row groups of multiple Parquet files in parallel. For large data sets, load the data set as multiple files.
  • Loading files with differing schemas is not supported.

Prerequisites

This tutorial assumes that:
  1. The Ocient System has network access to S3 from the Loader Nodes.
  2. An Ocient System is installed and configured with an active Storage Cluster. For details, see Ocient Application Configuration.

Parquet Loading Example

Follow these steps to load Parquet data into the Ocient System.

Step 1: Create a New Database

Connect to a SQL Node using the Commands Supported by the Ocient JDBC CLI Program. Then, execute the CREATE DATABASE SQL statement to create the metabase database.
SQL
CREATE DATABASE metabase;

Step 2: Create a New Table in the Database

Create the orders table in the new database. First, connect to that database (e.g., connect to jdbc:ocient://sql-node:4050/metabase), and then execute this CREATE TABLE SQL statement that specifies to create a table with these columns and a clustering index based on the user_id and product_id columns:
  • created_at as a timestamp that is not nullable.
  • id, user_id, and product_id as integers that are not nullable.
  • subtotal, tax, total, and discount as floating point numbers.
  • quantity as an integer.
SQL
CREATE TABLE public.orders (
    created_at TIMESTAMP TIME KEY BUCKET(30, DAY) NOT NULL,
    id INT NOT NULL,
    user_id INT NOT NULL,
    product_id INT NOT NULL,
    subtotal DOUBLE,
    tax DOUBLE,
    total DOUBLE,
    discount DOUBLE,
    quantity INT,
    CLUSTERING INDEX idx01 (user_id, product_id)
);
The database creates the orders table, and you can begin loading data.

Step 3: Preview and Create a Data Pipeline

Create data pipelines using the CREATE PIPELINE SQL statement. To load data, you first create a pipeline with the definition of the source, data format, and transformation rules using a SQL-like declarative syntax. Then, you execute the START PIPELINE SQL statement to start the load. You can observe progress and status using system catalog tables and views. Each Ocient pipeline defines a single data source and the target table or tables into which data loads. A data source includes the location of the source and filters on the source to define the specific data set to load. This tutorial loads data from two data sources, where each source is located in a directory on the same S3 bucket. First, inspect the data that you plan to load. Each document has a JSON format similar to this example.
JSON
{"id": 1, "user_id": 1, "product_id": 14, "subtotal": 37.65, "tax": 2.07, "total": 39.72, "discount": null, "created_at": "2019-02-11T21:40:27.892Z", "quantity": 2}
{"id": 2, "user_id": 1, "product_id": 123, "subtotal": 110.93, "tax": 6.1, "total": 117.03, "discount": null, "created_at": "2018-05-15T08:04:04.580Z", "quantity": 3}

Inspecting the file with Pandas, you can see this schema with details.
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   id          18760 non-null  int64
 1   user_id     18760 non-null  int64
 2   product_id  18760 non-null  int64
 3   subtotal    18760 non-null  float64
 4   tax         18760 non-null  float64
 5   total       18760 non-null  float64
 6   discount    1915 non-null   float64
 7   created_at  18760 non-null  datetime64[ns, UTC]
 8   quantity    18760 non-null  int64
In this case, Ocient automatically transforms the data to the target columns using some sensible conventions. In other cases, loads require some transformation. Most transformations are identical to functions that already exist in the SQL syntax of the Ocient System. Prior to creating a pipeline, you can use the PREVIEW PIPELINE SQL statement to create your pipeline iteratively. This statement returns a result set that shows the final values that would be loaded but does not load the data into the target table. Preview the orders_pipeline pipeline for the orders data set from your database connection prompt. Use the S3 data source with endpoint https://s3.us-east-1.amazonaws.com, bucket ocient-docs, and filter metabase_samples/parquet/orders.parquet. Specify the parquet format. Load the data into the public.orders table. The SELECT part of the SQL statement maps the fields in the Parquet file to the target columns in the created table. In this example, limit the result set to the first five records in the data source.
SQL
PREVIEW PIPELINE orders_pipeline
SOURCE
    S3
        ENDPOINT 'https://s3.us-east-1.amazonaws.com'
        BUCKET 'ocient-docs'
        FILTER 'metabase_samples/parquet/orders.parquet'
    LIMIT 5
EXTRACT
    FORMAT parquet
INTO public.orders
SELECT
    $id as id,
    $user_id as user_id,
    $product_id as product_id,
    $subtotal as subtotal,
    $tax as tax,
    $total as total,
    $discount as discount,
    $created_at as created_at,
    $quantity as quantity;
Output
id         user_id    product_id subtotal              tax                   total                 discount              created_at                   quantity
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
1          1          14         37.65                 2.07                  39.72                 NULL                  2019-02-11 21:40:27.892000000 2
8          1          123        110.93                6.1                   117.03                NULL                  2018-05-15 08:04:04.580000000 3
8          1          105        52.72                 2.9                   49.2                  6.42                  2019-12-06 22:22:48.544000000 2
8          1          94         109.22                6.01                  115.23                NULL                  2019-08-22 16:30:42.392000000 6
8          1          132        127.88                7.03                  134.91                NULL                  2018-10-10 03:34:47.309000000 5

Fetched 5 rows
With this preview, you can confirm that the results of the pipeline match your requirements. If there is an issue, you can update the statement and run the PREVIEW PIPELINE statement again until it meets your needs. Next, create the pipeline named orders_pipeline for the orders data set from your database connection prompt. The pipeline has three main sections:
  • SOURCE — Loads data from S3. Set the S3 endpoint, bucket name, and filter for the Parquet files.
  • EXTRACT — Sets the format to Parquet.
  • INTO ... SELECT — Targets the public.orders table and selects the chosen fields from the Parquet records. In this case, all data is available at the top level of the Parquet records, so the example references the fields by the attribute name (e.g., $id, $user_id, etc.). For nested data, reference the nested fields using dot notation (e.g., $order.user.first_name). Each field maps to a target column using the as syntax.
CREATE PIPELINE orders_pipeline
SOURCE
    S3
        ENDPOINT 'https://s3.us-east-1.amazonaws.com'
        BUCKET 'ocient-docs'
        FILTER 'metabase_samples/parquet/orders.parquet'
EXTRACT
    FORMAT parquet
INTO public.orders
SELECT
    $id as id,
    $user_id as user_id,
    $product_id as product_id,
    $subtotal as subtotal,
    $tax as tax,
    $total as total,
    $discount as discount,
    $created_at as created_at,
    $quantity as quantity;
After you successfully create this pipeline, execute the START PIPELINE SQL statement to start the load.
START PIPELINE orders_pipeline;

Step 4: Observe the Load Progress

With your pipeline running, data begins to load immediately from the S3 files that you defined. If there are many files in each file group, the load process first sorts the files into batches, partitions them for parallel processing, and assigns them to Loader Nodes. You can check the pipeline status and progress by querying the information_schema.pipeline_status system catalog table or executing the SHOW PIPELINE_STATUS SQL statement.
SHOW PIPELINE_STATUS;
Output
database_name         pipeline_name       table_names          status      status_message                               percent_complete duration_seconds files_processed     files_failed        files_remaining     records_processed   records_loaded      records_failed
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
metabase              orders_pipeline     ["public.orders"]    RUNNING     Started processing pipeline orders_pipeline  0.0              2.025824         0                   0                   1                   0                   0                   0
After the status of the pipeline changes to COMPLETED, all data is available in the target table. After a few seconds, the data is available for query in the public.orders table.
SELECT COUNT(*) FROM public.orders;
Output
count(*)
--------------------
18760
You can drop the pipeline with the DROP PIPELINE orders_pipeline; SQL statement. Execution of this statement leaves the data in your target table, but removes metadata about the pipeline execution from the system. Data Pipelines Reference Load JSON Data Data Types for Data Pipelines Transform Data in Data Pipelines Manage Errors in Data Pipelines
Last modified on May 27, 2026