Skip to main content
The data pipeline functionality enables the loading of data from . You can load various file types stored in HDFS into the System. The Ocient System uses data pipelines to transform each document into rows in one or more different tables. The loading and transformation capabilities use a simple SQL-like syntax for transforming data. This tutorial guides you through a simple example of loading data in the JSON format.

HDFS Advanced Source Options

Besides the specified options in the HDFS Source Options section, you can also specify these options that are available for file system sources:
  • COMPRESSION_METHOD
  • START_FILENAME
  • END_FILENAME
  • START_CREATED_TIMESTAMP
  • END_CREATED_TIMESTAMP
  • START_MODIFIED_TIMESTAMP
  • END_MODIFIED_TIMESTAMP
  • SORT_BY
  • SORT_DIRECTION
  • SORT_REWRITE
For the CONFIG source option, the most useful properties are the ones with the dfs.client prefix. For the full property reference, see HDFS Default and Core Default. Only properties that affect client connections and read operations apply to data pipeline loading. This table describes some common properties.
Property NameProperty Description
dfs.client.use.datanode.hostnameSet this property to true when you have datanodes and namenodes in the HDFS cluster configured to connect to each other using local IP addresses, but loading occurs externally and requires externally valid hostnames instead. The default value is false.
dfs.client.retry.max.attemptsConfigure to change the number of times the loading process retries network requests to namenodes before throwing an error. You can increase this value when network connections are unstable. The default value is 10.
dfs.client.socket-timeoutConfigure to change the timeout (in milliseconds) for all sockets used internally in HDFS loading. You can increase this value when network connections are unstable. The default value is 60000 (60 seconds).

HDFS Loading Example

Follow these steps to load JSON data from an HDFS source into the Ocient System.

Step 1: Create a New Database

Connect to a SQL Node using the Commands Supported by the Ocient JDBC CLI Program. Then, execute the CREATE DATABASE SQL statement to create the geo database.
SQL
CREATE DATABASE geo;

Step 2: Create a New Table in the Database

Create the locations table in the public schema to store location data with a name, zip code, and a point with latitude and longitude:
  • name — Location name as a not nullable string
  • zipcode — Zip code as an integer
  • location — Location latitude and longitude as a point
SQL
CREATE TABLE public.locations (
    name VARCHAR(255) NOT NULL,
    zipcode INT NOT NULL,
    location POINT
);

Step 3: Preview and Create a Data Pipeline

Preview the locations data pipeline to load JSON data from HDFS. This data pipeline uses the HDFS endpoint hdfs-namenode:9000, which consists of the namenode and port number, and the /locations/2026/**/*.json filter to load data from JSON files. The pipeline selects the name, zip code, and point data. For the point data construction, see ST_POINT.
SQL
PREVIEW PIPELINE locations
  SOURCE hdfs
    ENDPOINT 'hdfs-namenode:9000'
    FILTER '/locations/2026/**/*.json'
  EXTRACT
    FORMAT json
  INTO locations
  SELECT
    $name AS name,
    $zipcode AS zipcode,
    ST_POINT($longitude, $latitude) AS location;
Create the locations data pipeline to load JSON data from HDFS.
SQL
CREATE PIPELINE locations
  SOURCE hdfs
    ENDPOINT 'hdfs-namenode:9000'
    FILTER '/locations/2026/**/*.json'
  EXTRACT
    FORMAT json
  INTO locations
  SELECT
    $name AS name,
    $zipcode AS zipcode,
    ST_POINT($longitude, $latitude) AS location;
After you successfully create the locations pipeline, execute the START PIPELINE SQL statement to start the data pipeline.
START PIPELINE orders_pipeline;

Step 4: Observe the Load Progress

With your pipeline running, data begins to load immediately from the JSON files. If there are many files in each file group, the load process first sorts the files into batches, partitions them for parallel processing, and assigns them to Loader Nodes. You can check the pipeline status and progress by querying the information_schema.pipeline_status system catalog table or executing the SHOW PIPELINE_STATUS SQL statement.
SHOW PIPELINE_STATUS;
Output
database_name         pipeline_name       table_names             status      status_message                               percent_complete duration_seconds files_processed     files_failed        files_remaining     records_processed   records_loaded      records_failed
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
geo                   locations           ["public.locations"]    RUNNING     Started processing pipeline locations        0.0              2.025824         0                   0                   1                   0                   0                   0
After the status of the pipeline changes to COMPLETED, all data is available in the target table. After a few seconds, the data is available for query in the public.locationstable.
SELECT COUNT(*) FROM public.locations;
Output
count(*)
--------------------
25605
You can drop the pipeline with the DROP PIPELINE locations; SQL statement. Execution of this statement leaves the data in your target table, but removes metadata about the pipeline execution from the system. Data Pipelines Reference Load JSON Data Data Types for Data Pipelines Transform Data in Data Pipelines Manage Errors in Data Pipelines
Last modified on May 27, 2026