Data Pipeline Load of JSON Data from HDFS

the data pipeline functionality enables the loading of data from {{hdfs}} you can load various file types stored in hdfs into the {{ocient}} system the ocient system uses data pipelines to transform each document into rows in one or more different tables the loading and transformation capabilities use a simple sql like syntax for transforming data this tutorial guides you through a simple example of loading data in the json format hdfs advanced source options besides the specified options in the docid\ pbyszqvu5wonpgoso qto section, you can also specify these options that are available for file system sources compression method start filename end filename start created timestamp end created timestamp start modified timestamp end modified timestamp sort by sort direction sort rewrite for the config source option, the most useful properties are the ones with the dfs client prefix for the full property reference, see https //hadoop apache org/docs/current/hadoop project dist/hadoop hdfs/hdfs default xml and https //hadoop apache org/docs/current/hadoop project dist/hadoop common/core default html only properties that affect client connections and read operations apply to data pipeline loading this table describes some common properties property name property description dfs client use datanode hostname set this property to true when you have datanodes and namenodes in the hdfs cluster configured to connect to each other using local ip addresses, but loading occurs externally and requires externally valid hostnames instead the default value is false dfs client retry max attempts configure to change the number of times the loading process retries network requests to namenodes before throwing an error you can increase this value when network connections are unstable the default value is 10 dfs client socket timeout configure to change the timeout (in milliseconds) for all sockets used internally in hdfs loading you can increase this value when network connections are unstable the default value is 60000 (60 seconds) hdfs loading example follow these steps to load json data from an hdfs source into the ocient system step 1 create a new database connect to a sql node using the docid\ zyojxip3zkuvr9skpiyxo then, execute the create database sql statement to create the geo database create database geo; step 2 create a new table in the database create the locations table in the public schema to store location data with a name, zip code, and a point with latitude and longitude name — location name as a not nullable string zipcode — zip code as an integer location — location latitude and longitude as a point create table public locations ( name varchar(255) not null, zipcode int not null, location point ); step 3 preview and create a data pipeline preview the locations data pipeline to load json data from hdfs this data pipeline uses the hdfs endpoint hdfs namenode 9000 , which consists of the namenode and port number, and the /locations/2026/ / json filter to load data from json files the pipeline selects the name, zip code, and point data for the point data construction, see docid 9nnprzt3hqeet7csvwwxe preview pipeline locations source hdfs endpoint 'hdfs namenode 9000' filter '/locations/2026/ / json' extract format json into locations select $name as name, $zipcode as zipcode, st point($longitude, $latitude) as location; create the locations data pipeline to load json data from hdfs create pipeline locations source hdfs endpoint 'hdfs namenode 9000' filter '/locations/2026/ / json' extract format json into locations select $name as name, $zipcode as zipcode, st point($longitude, $latitude) as location; after you successfully create the locations pipeline, execute the start pipeline sql statement to start the data pipeline sql start pipeline orders pipeline; step 4 observe the load progress with your pipeline running, data begins to load immediately from the json files if there are many files in each file group, the load process first sorts the files into batches, partitions them for parallel processing, and assigns them to loader nodes you can check the pipeline status and progress by querying the information schema pipeline status system catalog table or executing the show pipeline status sql statement sql show pipeline status; output sql database name pipeline name table names status status message percent complete duration seconds files processed files failed files remaining records processed records loaded records failed \ geo locations \["public locations"] running started processing pipeline locations 0 0 2 025824 0 0 1 0 0 0 after the status of the pipeline changes to completed , all data is available in the target table after a few seconds, the data is available for query in the public locations table sql select count( ) from public locations; output sql count( ) \ 25605 you can drop the pipeline with the drop pipeline locations; sql statement execution of this statement leaves the data in your target table, but removes metadata about the pipeline execution from the system related links docid\ pbyszqvu5wonpgoso qto docid\ y731i6lhout2b 7mc1jr8 docid\ jfqu osagg5enkvmeesnl docid\ ti3mdibvgmuudmlqu9xpl docid\ xkhtu5p wt1py0q1rq2xv