> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Pipeline Load of JSON Data from HDFS

export const Ocient = "Ocient®";

export const HDFS = "Apache® Hadoop® Distributed File System (HDFS)";

The data pipeline functionality enables the loading of data from {HDFS}. You can load various file types stored in HDFS into the {Ocient} System.

The Ocient System uses data pipelines to transform each document into rows in one or more different tables. The loading and transformation capabilities use a simple SQL-like syntax for transforming data. This tutorial guides you through a simple example of loading data in the JSON format.

## HDFS Advanced Source Options

Besides the specified options in the [HDFS Source Options](/data-pipelines#hdfs-source-options) section, you can also specify these options that are available for file system sources:

* COMPRESSION\_METHOD
* START\_FILENAME
* END\_FILENAME
* START\_CREATED\_TIMESTAMP
* END\_CREATED\_TIMESTAMP
* START\_MODIFIED\_TIMESTAMP
* END\_MODIFIED\_TIMESTAMP
* SORT\_BY
* SORT\_DIRECTION
* SORT\_REWRITE

For the `CONFIG` source option, the most useful properties are the ones with the `dfs.client` prefix. For the full property reference, see [HDFS Default](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml) and [Core Default](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.html). Only properties that affect client connections and read operations apply to data pipeline loading. This table describes some common properties.

| **Property Name**                  | **Property Description**                                                                                                                                                                                                                                         |
| ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `dfs.client.use.datanode.hostname` | Set this property to `true` when you have datanodes and namenodes in the HDFS cluster configured to connect to each other using local IP addresses, but loading occurs externally and requires externally valid hostnames instead. The default value is `false`. |
| `dfs.client.retry.max.attempts`    | Configure to change the number of times the loading process retries network requests to namenodes before throwing an error. You can increase this value when network connections are unstable. The default value is 10.                                          |
| `dfs.client.socket-timeout`        | Configure to change the timeout (in milliseconds) for all sockets used internally in HDFS loading. You can increase this value when network connections are unstable. The default value is 60000 (60 seconds).                                                   |

## HDFS Loading Example

Follow these steps to load JSON data from an HDFS source into the Ocient System.

### Step 1: Create a New Database

Connect to a SQL Node using the [Commands Supported by the Ocient JDBC CLI Program](/commands-supported-by-the-ocient-jdbc-cli-program). Then, execute the `CREATE DATABASE` SQL statement to create the `geo` database.

```sql SQL theme={null}
CREATE DATABASE geo;
```

### Step 2: Create a New Table in the Database

Create the `locations` table in the `public` schema to store location data with a name, zip code, and a point with latitude and longitude:

* `name` — Location name as a not nullable string
* `zipcode` — Zip code as an integer
* `location` — Location latitude and longitude as a point

```sql SQL theme={null}
CREATE TABLE public.locations (
    name VARCHAR(255) NOT NULL,
    zipcode INT NOT NULL,
    location POINT
);
```

### Step 3: Preview and Create a Data Pipeline

Preview the `locations` data pipeline to load JSON data from HDFS. This data pipeline uses the HDFS endpoint `hdfs-namenode:9000`, which consists of the namenode and port number, and the `/locations/2026/**/*.json` filter to load data from JSON files. The pipeline selects the name, zip code, and point data. For the point data construction, see [ST\_POINT](/point-constructors#st_point).

```sql SQL theme={null}
PREVIEW PIPELINE locations
  SOURCE hdfs
    ENDPOINT 'hdfs-namenode:9000'
    FILTER '/locations/2026/**/*.json'
  EXTRACT
    FORMAT json
  INTO locations
  SELECT
    $name AS name,
    $zipcode AS zipcode,
    ST_POINT($longitude, $latitude) AS location;
```

Create the `locations` data pipeline to load JSON data from HDFS.

```sql SQL theme={null}
CREATE PIPELINE locations
  SOURCE hdfs
    ENDPOINT 'hdfs-namenode:9000'
    FILTER '/locations/2026/**/*.json'
  EXTRACT
    FORMAT json
  INTO locations
  SELECT
    $name AS name,
    $zipcode AS zipcode,
    ST_POINT($longitude, $latitude) AS location;
```

After you successfully create the `locations` pipeline, execute the `START PIPELINE` SQL statement to start the data pipeline.

<CodeGroup>
  ```sql SQL theme={null}
  START PIPELINE orders_pipeline;
  ```
</CodeGroup>

### Step 4: Observe the Load Progress

With your pipeline running, data begins to load immediately from the JSON files. If there are many files in each file group, the load process first sorts the files into batches, partitions them for parallel processing, and assigns them to Loader Nodes.

You can check the pipeline status and progress by querying the `information_schema.pipeline_status` system catalog table or executing the `SHOW PIPELINE_STATUS` SQL statement.

<CodeGroup>
  ```sql SQL theme={null}
  SHOW PIPELINE_STATUS;
  ```
</CodeGroup>

Output

<CodeGroup>
  ```sql SQL theme={null}
  database_name         pipeline_name       table_names             status      status_message                               percent_complete duration_seconds files_processed     files_failed        files_remaining     records_processed   records_loaded      records_failed
  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  geo                   locations           ["public.locations"]    RUNNING     Started processing pipeline locations        0.0              2.025824         0                   0                   1                   0                   0                   0
  ```
</CodeGroup>

After the status of the pipeline changes to `COMPLETED`, all data is available in the target table.

After a few seconds, the data is available for query in the `public.locations`table.

<CodeGroup>
  ```sql SQL theme={null}
  SELECT COUNT(*) FROM public.locations;
  ```
</CodeGroup>

*Output*

<CodeGroup>
  ```sql SQL theme={null}
  count(*)
  --------------------
  25605
  ```
</CodeGroup>

You can drop the pipeline with the `DROP PIPELINE locations;` SQL statement. Execution of this statement leaves the data in your target table, but removes metadata about the pipeline execution from the system.

## Related Links

[Data Pipelines Reference](/data-pipelines)

[Load JSON Data](/data-formats-for-data-pipelines#load-json-data)

[Data Types for Data Pipelines](/data-types-for-data-pipelines)

[Transform Data in Data Pipelines](/transform-data-in-data-pipelines)

[Manage Errors in Data Pipelines](/manage-errors-in-data-pipelines)
