LAT Load JSON Data from S3

data pipelines are now the preferred method for loading data into the ocient system for details, see docid\ xq0tg7yph vn62uwufibu a common setup for batch loading files into {{ocient}} is to load from a bucket on {{aws}} s3 with time partitioned data in many instances, a batch load is performed on a recurring basis to load new files the lat transforms each document into rows in one or more different tables ocient’s loading and transformation capabilities use a simple sql like syntax for transforming data this tutorial will guide users through a simple example load using a small set of data in jsonl (newline delimited json) format the data in this example is created from a test set for the {{metabase}} business intelligence tool prerequisites this tutorial assumes that the ocient system has network access to s3 from the loader nodes an ocient system is installed and configured with an active storage cluster (see the docid\ nneedy7yn8g1pennmamng guide) loading and transformation is installed on the loader nodes a default "sink" for the ocient loader nodes is configured on the system the docid\ xpvlz0ewuxmgynxvxz jb is installed step 1 create a new database to begin, you are going to load two example tables in a database first, connect to a sql node using the docid 7uosju7ajx4yd61vqbhqu then run the following ddl command create database metabase; step 2 create tables to create tables in the new database, first connect to that database (e g , connect to jdbc\ ocient //sql node 4050/metabase ), then run the following ddl commands create table public orders ( created at timestamp time key bucket(1, day) not null, id int not null, user id int not null, product id int not null, subtotal double, tax double, total double, discount double, quantity int, clustering index idx01 (user id, product id) ); create table public products( created at timestamp time key bucket(1, day) not null, id int not null, ean varchar(255), title varchar(255), category varchar(255) compression gdc(2) not null, vendor varchar(255), price double, rating double, clustering index idx01 (category) ); now, the database tables are created and you can begin loading data step 3 create a data pipeline data pipelines are created using a simple loading configuration that is submitted to the transformation nodes to start loading file groups designate a batch of files to load each file group is routed to one or more ocient tables, and each column is the result of a transformation applied to the source document first, let’s inspect the data you plan to load each document has a format similar to the following example / orders / {"id" 1, "user id" 1, "product id" 14, "subtotal" 37 65, "tax" 2 07, "total" 39 72, "discount" null, "created at" "2019 02 11t21 40 27 892z", "quantity" 2} {"id" 2, "user id" 1, "product id" 123, "subtotal" 110 93, "tax" 6 1, "total" 117 03, "discount" null, "created at" "2018 05 15t08 04 04 580z", "quantity" 3} / products / {"id" 1, "ean" "1018947080336", "title" "rustic paper wallet", "category" "gizmo", "vendor" "swaniawski, casper and hilll", "price" 29 46, "rating" 4 6, "created at" "2017 07 19t19 44 56 582z"} {"id" 2, "ean" "7663515285824", "title" "small marble shoes", "category" "doohickey", "vendor" "balistreri ankunding", "price" 70 08, "rating" 0, "created at" "2019 04 11t08 49 35 932z"} {"id" 3, "ean" "4966277046676", "title" "synergistic granite chair", "category" "doohickey", "vendor" "murray, watsica and wunsch", "price" 35 39, "rating" 4, "created at" "2018 09 08t22 03 20 239z"} as you can see, this is similar to your target schema, but will require some transformation most transformations are identical to functions already in ocient’s sql dialect to route data to your tables, you need to create a pipeline json file that has the following structure { "version" 2, "workers" 4, "source" { "type" "s3", "endpoint" "https //s3 us east 1 amazonaws com", "bucket" "ocient examples", "compression" "none", "file groups" { "orders" { "prefix" "metabase samples/jsonl", "file matcher syntax" "glob", "file matcher pattern" " orders jsonl", "sort type" "lexicographic" }, "products" { "prefix" "metabase samples/jsonl", "file matcher syntax" "glob", "file matcher pattern" " products jsonl", "sort type" "lexicographic" } } }, "transform" { "file groups" { "orders" { "tables" { "metabase public orders" { "columns" { "id" "id", "user id" "user id", "product id" "product id", "subtotal" "subtotal", "tax" "tax", "total" "total", "discount" "discount", "created at" "to timestamp(created at, 'yyyy mm dd\\\\'t\\\\'hh\ mm\ ss\[ sss]x')", "quantity" "quantity" } } } }, "products" { "tables" { "metabase public products" { "columns" { "id" "id", "ean" "ean", "title" "title", "category" "category", "vendor" "vendor", "price" "price", "rating" "rating", "created at" "to timestamp(created at, 'yyyy mm dd\\\\'t\\\\'hh\ mm\ ss\[ sss]x')" } } } } } } } the most interesting part of this pipeline json file is the way it defines the file groups note that each sets the s3 endpoint, a bucket, a prefix used for filtering the considered files, and then a file matcher in this case you only have a single file, but if there were many files matching the pattern orders jsonl then they would all be part of the file group the final parameter that you supplied is the sort type for the file load this informs the lat how you would like data to be ordered when loading the ideal sort is in time order according to the defined {{timekey}} this makes more efficient segments and is much faster to load in this case, you used the lexicographic sort which orders according to the characters in the file name other sort types are available to use file modified time or to extract the timestamp for sorting from the file path or file name step 4 using the loading and transformation cli with a pipeline json file ready to go, you can test this pipeline to test, use the lat cli for these examples, assume that two lats are configured and set using an environment variable first configure the lat cli to use the hosts of the ocient loading and transformation service you can add these to every cli command as a flag, but for simplicity you can also set them as environment variables from a command line, run the following command replacing the ip addresses with the ip addresses of your lat processes export lat hosts="https //10 0 0 1 8443,https //10 0 0 2 8443" if your lat is running without tls configured, replace the port number of your lat hosts with 8080 and the protocol with http // next, check on the status of the lat lat client pipeline status example response 10 0 0 1 8443 stopped 10 0 0 2 8443 stopped this confirms that you can reach the lat from your cli if the status is "running" it means a pipeline is already executing a pipeline next, you are going to update and start your new pipeline this example uses secure connections if you receive an ssl error when testing, your service might not be configured to use tls or you might need to use the no verify flag if the certificate validation fails step 5 test the transformation the cli supports previewing a transformation with an example document and the pipeline file this makes it easy to test your transformations first, save an example document to your file system to use for this test for this demo, you can download an example file from https //ocient examples s3 amazonaws com/metabase samples/jsonl/orders jsonl and save it to /orders jsonl next, make sure the pipeline json file that you created is stored at /pipeline json now that both files are available, you can run the cli to preview the results pass the preview command the topic name, the pipeline file, and the sample record file the response contains the transformed data tied to the destination table and a list of any error records similar to how you can preview records on a {{kafka}} topic for file loads, you can supply any one of the topics you created as file groups to preview the transformations lat client preview topic orders pipeline /pipeline json records /orders jsonl example response { "tablerecords" { "metabase public orders" \[ { "id" 1, "user id" 1, "product id" 14, "subtotal" 37 65, "tax" 2 07, "total" 39 72, "discount" null, "created at" 1549921227892000000, "quantity" 2 }, { "id" 2, "user id" 1, "product id" 123, "subtotal" 110 93, "tax" 6 1, "total" 117 03, "discount" null, "created at" 1526371444580000000, "quantity" 3 }, { "id" 3, "user id" 1, "product id" 105, "subtotal" 52 72, "tax" 2 9, "total" 49 2, "discount" 6 42, "created at" 1575670968544000000, "quantity" 2 } ] }, "recorderrors" \[] } you can see that the data is transformed and the columns to which each transformed value will be mapped if there are issues in the values, these will appear in the recorderrors object you can quickly update your pipeline json file and preview again now, you can inspect different documents to confirm that various states of data cleanliness like missing columns, null values, and special characters are well handled by your transformations step 6 configure and start the data pipeline with a tested transformation, the next step is to setup and start the data pipeline first, you must configure the pipeline using the pipeline create command this validates and creates the pipeline, but will not take effect until you start the pipeline lat client pipeline create pipeline /pipeline json example response 10 0 0 1 8443 created 10 0 0 2 8443 created in cases where there is an existing pipeline operating, it is necessary to stop the pipeline and remove the original pipeline before creating and starting the new pipeline now that the pipeline has been created on all lat nodes, you can start the lat by running the pipeline start commands lat client pipeline start example responses 10 0 0 1 8443 running 10 0 0 2 8443 running step 7 confirm that loading is operating correctly with your pipeline in place and running, data will immediately begin loading from the s3 file groups you defined if there were many files per file group, the lat would first sort the files, then partition them for the fastest loading based on the sorting criteria you provided observing loading progress with the pipeline running, data immediately begins to load into ocient to observe this progress, you can use the pipeline status command from the lat client or monitor the lat metrics endpoint of the loader nodes you can check the status with this command by using the list files flag to include a summary of the files included in the load lat client pipeline status list files example responses 10 0 0 1 8443 running 10 0 0 2 8443 running pipeline files processed 0 pipeline error count 0 pipeline files remaining 2 orders files processed 0 error count 0 files remaining 1 products files processed 0 error count 0 files remaining 1 orders status filename processing metabase samples/jsonl/orders jsonl in process files processing metabase samples/jsonl/orders jsonl products status filename processing metabase samples/jsonl/products jsonl in process files processing metabase samples/jsonl/products jsonl command curl https //127 0 0 1 8443/v2/metrics/lat\ type=pipeline if your lat is running without tls configured, replace the port number of your lat hosts with 8080 and the protocol with http // example response { "request" { "mbean" "lat\ type=pipeline", "type" "read" }, "value" { "partitions" \[ { "offsets durable" 1, "pushes errors" 0, "pushes attempts" 1, "rows pushed" 1, "offsets written" 18759, "records buffered" 0, "records errors column" 0, "records errors deserialization" 0, "source bytes buffered" 0, "records errors transformation" 0, "offsets processed" 18759, "partition" "table orders 0", "records filter accepted" 1, "records errors row" 0, "records filter rejected" 0, "records errors generic" 0, "producer send attempts" 0, "offsets pushed" 18759, "pushes unacknowledged" 0, "invalid state" 0, "bytes pushed" 88, "records errors total" 0, "offsets buffered" 18759, "complete" 0, "offsets end" 1, "producer send errors" 0 }, { "offsets durable" 1, "pushes errors" 0, "pushes attempts" 1, "rows pushed" 1, "offsets written" 199, "records buffered" 0, "records errors column" 0, "records errors deserialization" 0, "source bytes buffered" 0, "records errors transformation" 0, "offsets processed" 199, "partition" "table products 0", "records filter accepted" 1, "records errors row" 0, "records filter rejected" 0, "records errors generic" 0, "producer send attempts" 0, "offsets pushed" 199, "pushes unacknowledged" 0, "invalid state" 0, "bytes pushed" 145, "records errors total" 0, "offsets buffered" 199, "complete" 0, "offsets end" 1, "producer send errors" 0 } ], "paused" 1, "bytes buffered" 0, "workers" 20 }, "timestamp" 1626970368, "status" 200 } check row counts in tables to confirm that you are seeing results in the target tables, you can also run some simple queries to check row counts depending on the streamloader role settings, the time for records to become queryable can vary from a few seconds to minutes example queries ocient> select count( ) from public orders; count( ) \ 18760 ocient> select count() from public products; count() \ 200 now you can explore the data in these four tables with any ocient sql queries check errors in this example, all rows load successfully however, a successful load does not always happen, and you can inspect errors using the lat client whenever the lat process fails to parse a file correctly or fails to transform or load a record, the lat process records an error the lat client includes the lat client pipeline errors command that reports the latest errors on the pipeline a full error log is available on the loader nodes these logs report all bad records and the reason that the load fails when you load a pipeline from kafka, the load might route errors to an error topic on the kafka broker instead of the logs the lat client does not contain the errors sent to the error topic you can inspect these errors with kafka utilities instead this lat client command displays a maximum of 100 error messages lat client pipeline errors max errors 100 only error messages \| | \| exception message | \| | \| column name time1 message failed to evaluate | \| expression cause | \| java time format datetimeparseexception | \| | \| column name time1 message failed to evaluate | \| expression cause | \| java time format datetimeparseexception | \| | the errors indicate that there is an issue parsing the time1 column options exist on the pipeline errors command to return json and to restrict the response to specific components of the error detail that includes a reference to the source location of this record the following command returns json that is delimited with newline characters you can pass the json output to jq or a file the json includes the source topic or file group, the filename where the error occurred, the offset that indicates the line number or kafka offset, and the exception message that aids in troubleshooting and identifying the incorrect record in the source data you can use the log original message pipeline setting to provide direct access to the parsed source record for errors when appropriate lat client pipeline errors max errors 100 json {"time" "2022 05 17t16 53 50 387386+00 00", "topic" "calcs", "partition" 0, "state" "transformation error", "exception message" "column name time1 message failed to evaluate expression cause java time format datetimeparseexception cannot parse time \\"19 36 22\\" with format string \\"value(hourofday,2)offset(+hhmm,'z')' 'value(minuteofhour,2)' 'value(secondofminute,2)\\"\njava time format datetimeparseexception cannot parse time \\"19 36 22\\" with format string \\"value(hourofday,2)offset(+hhmm,'z')' 'value(minuteofhour,2)' 'value(secondofminute,2)\\"\ncannot parse time \\"19 36 22\\" with format string \\"value(hourofday,2)offset(+hhmm,'z')' 'value(minuteofhour,2)' 'value(secondofminute,2)\\"", "offset" 0, "record" null, "metadata" {"size" "3321", "filename" "calcs/csv/calcs 01 csv"}} {"time" "2022 05 17t16 53 50 404684+00 00", "topic" "calcs", "partition" 0, "state" "transformation error", "exception message" "column name time1 message failed to evaluate expression cause java time format datetimeparseexception cannot parse time \\"02 05 25\\" with format string \\"value(hourofday,2)offset(+hhmm,'z')' 'value(minuteofhour,2)' 'value(secondofminute,2)\\"\njava time format datetimeparseexception cannot parse time \\"02 05 25\\" with format string \\"value(hourofday,2)offset(+hhmm,'z')' 'value(minuteofhour,2)' 'value(secondofminute,2)\\"\ncannot parse time \\"02 05 25\\" with format string \\"value(hourofday,2)offset(+hhmm,'z')' 'value(minuteofhour,2)' 'value(secondofminute,2)\\"", "offset" 1, "record" null, "metadata" {"size" "3321", "filename" "calcs/csv/calcs 01 csv"}} related links docid\ cdutjfrhb4hmidwoafvno docid 4wejuau6gpdqyii5qqtqt docid\ elwhwxe8oruff36xf4fom