HDFS Advanced Source Options
Besides the specified options in the HDFS Source Options section, you can also specify these options that are available for file system sources:- COMPRESSION_METHOD
- START_FILENAME
- END_FILENAME
- START_CREATED_TIMESTAMP
- END_CREATED_TIMESTAMP
- START_MODIFIED_TIMESTAMP
- END_MODIFIED_TIMESTAMP
- SORT_BY
- SORT_DIRECTION
- SORT_REWRITE
CONFIG source option, the most useful properties are the ones with the dfs.client prefix. For the full property reference, see HDFS Default and Core Default. Only properties that affect client connections and read operations apply to data pipeline loading. This table describes some common properties.
| Property Name | Property Description |
|---|---|
dfs.client.use.datanode.hostname | Set this property to true when you have datanodes and namenodes in the HDFS cluster configured to connect to each other using local IP addresses, but loading occurs externally and requires externally valid hostnames instead. The default value is false. |
dfs.client.retry.max.attempts | Configure to change the number of times the loading process retries network requests to namenodes before throwing an error. You can increase this value when network connections are unstable. The default value is 10. |
dfs.client.socket-timeout | Configure to change the timeout (in milliseconds) for all sockets used internally in HDFS loading. You can increase this value when network connections are unstable. The default value is 60000 (60 seconds). |
HDFS Loading Example
Follow these steps to load JSON data from an HDFS source into the Ocient System.Step 1: Create a New Database
Connect to a SQL Node using the Commands Supported by the Ocient JDBC CLI Program. Then, execute theCREATE DATABASE SQL statement to create the geo database.
SQL
Step 2: Create a New Table in the Database
Create thelocations table in the public schema to store location data with a name, zip code, and a point with latitude and longitude:
name— Location name as a not nullable stringzipcode— Zip code as an integerlocation— Location latitude and longitude as a point
SQL
Step 3: Preview and Create a Data Pipeline
Preview thelocations data pipeline to load JSON data from HDFS. This data pipeline uses the HDFS endpoint hdfs-namenode:9000, which consists of the namenode and port number, and the /locations/2026/**/*.json filter to load data from JSON files. The pipeline selects the name, zip code, and point data. For the point data construction, see ST_POINT.
SQL
locations data pipeline to load JSON data from HDFS.
SQL
locations pipeline, execute the START PIPELINE SQL statement to start the data pipeline.
Step 4: Observe the Load Progress
With your pipeline running, data begins to load immediately from the JSON files. If there are many files in each file group, the load process first sorts the files into batches, partitions them for parallel processing, and assigns them to Loader Nodes. You can check the pipeline status and progress by querying theinformation_schema.pipeline_status system catalog table or executing the SHOW PIPELINE_STATUS SQL statement.
COMPLETED, all data is available in the target table.
After a few seconds, the data is available for query in the public.locationstable.
DROP PIPELINE locations; SQL statement. Execution of this statement leaves the data in your target table, but removes metadata about the pipeline execution from the system.

