Skip to main content
Data Pipelines are now the preferred method for loading data into the System. For details, see Load Data.
A common setup for streaming data into Ocient is to send JSON documents to and then transform each document into rows in one or more different tables. Ocient’s Loading and Transformation capabilities use a simple SQL-like syntax for transforming data. This tutorial will guide users through a simple example load using a small set of data in JSON format. The data in this example is created from a test set for the Business Intelligence tool.

Prerequisites

This tutorial assumes that:
  1. A Kafka cluster is operational and can be reached by the Ocient Loader Nodes.
  2. An Ocient System is installed and configured with an active storage cluster (See the Ocient Application Configuration guide).
  3. The Ocient Loader Nodes are running the latest Loading and Transformation version which is configured to connect to Kafka for stream loading.
  4. A default “sink” for the Ocient Loader Nodes is configured on the system.
  5. The LAT Client Command Line Interface is installed.
  6. The test data for this Tutorial can be found at the following S3 addresses.
You must be logged into Amazon AWS to download these files.https://ocient-docs.s3.amazonaws.com/metabase_samples/jsonl/orders.jsonl https://ocient-docs.s3.amazonaws.com/metabase_samples/jsonl/products.jsonl https://ocient-docs.s3.amazonaws.com/metabase_samples/jsonl/people.jsonl https://ocient-docs.s3.amazonaws.com/metabase_samples/jsonl/reviews.jsonl

Step 1: Create a New Database

To begin, you are going to load four example tables in a database. First, connect to a SQL Node using the Commands Supported by the Ocient JDBC CLI Program. Then run the following DDL command:
SQL
CREATE DATABASE metabase;

Step 2: Create Tables

To create tables in the new database, first connect to that database (e.g., connect to jdbc:ocient://sql-node:4050/metabase), then run the following DDL commands:
SQL
CREATE TABLE public.orders (
  created_at TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
  id INT NOT NULL,
  user_id INT NOT NULL,
  product_id INT NOT NULL,
  subtotal DOUBLE,
  tax DOUBLE,
  total DOUBLE,
  discount DOUBLE,
  quantity INT,
  CLUSTERING INDEX idx01 (user_id, product_id)
);

CREATE TABLE public.people (
  created_at TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
  id INT NOT NULL,
  address VARCHAR(255),
  email VARCHAR(255),
  password VARCHAR(255),
  name VARCHAR(255),
  city VARCHAR(255),
  longitude DOUBLE,
  state VARCHAR(255),
  source VARCHAR(255),
  birth_date DATE,
  zip VARCHAR(255),
  latitude DOUBLE,
  CLUSTERING INDEX idx01 (id)
);

CREATE TABLE public.products(
  created_at TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
  id INT NOT NULL,
  ean VARCHAR(255),
  title VARCHAR(255),
  category VARCHAR(255) COMPRESSION GDC(2) NOT NULL,
  vendor VARCHAR(255),
  price DOUBLE,
  rating DOUBLE,
  CLUSTERING INDEX idx01 (category)
);

CREATE TABLE public.reviews (
  created_at TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
  id INT NOT NULL,
  product_id INT NOT NULL,
  reviewer VARCHAR(255),
  rating INT,
  body VARCHAR(255),
  CLUSTERING INDEX idx01 (product_id)
);
Now, the database tables are created, and you can begin loading data.

Step 3: Create a Data Pipeline

Data pipelines are created using a simple loading configuration that is submitted to the Transformation Nodes to start loading. Each Kafka topic is routed to one or more Ocient tables, and each column is the result of a transformation applied to the source document. First, inspect the data that you load. Each document has a format similar to the following example.
JSON
/* orders */
{"id": 1, "user_id": 1, "product_id": 14, "subtotal": 37.65, "tax": 2.07, "total": 39.72, "discount": null, "created_at": "2019-02-11T21:40:27.892Z", "quantity": 2}
{"id": 2, "user_id": 1, "product_id": 123, "subtotal": 110.93, "tax": 6.1, "total": 117.03, "discount": null, "created_at": "2018-05-15T08:04:04.580Z", "quantity": 3}
...

/* products */
{"id": 1, "ean": "1018947080336", "title": "Rustic Paper Wallet", "category": "Gizmo", "vendor": "Swaniawski, Casper and Hilll", "price": 29.46, "rating": 4.6, "created_at": "2017-07-19T19:44:56.582Z"}
{"id": 2, "ean": "7663515285824", "title": "Small Marble Shoes", "category": "Doohickey", "vendor": "Balistreri-Ankunding", "price": 70.08, "rating": 0, "created_at": "2019-04-11T08:49:35.932Z"}
{"id": 3, "ean": "4966277046676", "title": "Synergistic Granite Chair", "category": "Doohickey", "vendor": "Murray, Watsica and Wunsch", "price": 35.39, "rating": 4, "created_at": "2018-09-08T22:03:20.239Z"}
...

/* people */
{"id": 1, "address": "9611-9809 West Rosedale Road", "email": "borer-hudson@yahoo.com", "password": "ccca881f-3e4b-4e5c-8336-354103604af6", "name": "Hudson Borer", "city": "Wood River", "longitude": -98.5259864, "state": "NE", "source": "Twitter", "birth_date": "1986-12-12", "zip": "68883", "latitude": 40.71314890000001, "created_at": "2017-10-07T01:34:35.462Z"}
{"id": 2, "address": "101 4th Street", "email": "williamson-domenica@yahoo.com", "password": "eafc45bf-cf8e-4c96-ab35-ce44d0021597", "name": "Domenica Williamson", "city": "Searsboro", "longitude": -92.6991321, "state": "IA", "source": "Affiliate", "birth_date": "1967-06-10", "zip": "50242", "latitude": 41.5813224, "created_at": "2018-04-09T12:10:05.167Z"}
{"id": 3, "address": "29494 Anderson Drive", "email": "lina.heaney@yahoo.com", "password": "36f67891-34e5-4439-a8a4-2d9246775ff8", "name": "Lina Heaney", "city": "Sandstone", "longitude": -92.8416108, "state": "MN", "source": "Facebook", "birth_date": "1961-12-18", "zip": "55072", "latitude": 46.11973039999999, "created_at": "2017-06-27T06:06:20.625Z"}
...

/* reviews */
{"id": 1, "product_id": 1, "reviewer": "christ", "rating": 5, "body": "Ad perspiciatis quis et consectetur. Laboriosam fuga voluptas ut et modi ipsum. Odio et eum numquam eos nisi. Assumenda aut magnam libero maiores nobis vel beatae officia.", "created_at": "2018-05-15T20:25:48.517Z"}
{"id": 2, "product_id": 1, "reviewer": "xavier", "rating": 4, "body": "Reprehenderit non error architecto consequatur tempore temporibus. Voluptate ut accusantium quae est. Aut sit quidem nihil maxime dolores molestias. Enim vel optio est fugiat vitae cumque ut. Maiores laborum rerum quidem voluptate rerum.", "created_at": "2019-08-07T13:50:33.401Z"}
{"id": 3, "product_id": 1, "reviewer": "cameron.nitzsche", "rating": 5, "body": "In aut numquam labore fuga. Et tempora sit et mollitia aut ullam et repellat. Aliquam sint tenetur culpa eius tenetur. Molestias ipsa est ut quisquam hic necessitatibus. Molestias maiores vero nesciunt.", "created_at": "2018-03-30T00:28:45.192Z"}
...
As you can see, this is similar to the target schema, but will require some transformation. Most transformations are identical to functions already in Ocient’s SQL dialect. To route data to your tables, you must create a pipeline.json file that has the following structure:
JSON
{
    "version": 2,
    "workers": 4,
    "pipeline_id": "pipeline-metabase",
    "source": {
        "type": "kafka",
        "kafka": {
          "bootstrap.servers": "127.0.0.1:9092",
          "auto.offset.reset": "earliest"
        }
    },
    "transform": {
        "topics": {
            "orders": {
                "tables": {
                    "metabase.public.orders": {
                        "columns": {
                            "id": "id",
                            "user_id": "user_id",
                            "product_id": "product_id",
                            "subtotal": "subtotal",
                            "tax": "tax",
                            "total": "total",
                            "discount": "discount",
                            "created_at": "to_timestamp(created_at, 'yyyy-MM-dd\\'T\\'HH:mm:ss[.SSS]X')",
                            "quantity": "quantity"
                        }
                    }
                }
            },
            "people": {
                "tables": {
                    "metabase.public.people": {
                        "columns": {
                            "id": "id",
                            "address": "address",
                            "email": "email",
                            "password": "password",
                            "name": "name",
                            "city": "city",
                            "longitude": "longitude",
                            "state": "state",
                            "source": "source",
                            "birth_date": "birth_date",
                            "zip": "zip",
                            "latitude": "latitude",
                            "created_at": "to_timestamp(created_at, 'yyyy-MM-dd\\'T\\'HH:mm:ss[.SSS]X')"
                        }
                    }
                }
            },
            "reviews": {
                "tables": {
                    "metabase.public.reviews": {
                        "columns": {
                            "id": "id",
                            "product_id": "product_id",
                            "reviewer": "reviewer",
                            "rating": "rating",
                            "body": "body",
                            "created_at": "to_timestamp(created_at, 'yyyy-MM-dd\\'T\\'HH:mm:ss[.SSS]ZZZZZ')"
                        }
                    }
                }
            },
            "products": {
                "tables": {
                    "metabase.public.products": {
                        "columns": {
                            "id": "id",
                            "ean": "ean",
                            "title": "title",
                            "category": "category",
                            "vendor": "vendor",
                            "price": "price",
                            "rating": "rating",
                            "created_at": "to_timestamp(created_at, 'yyyy-MM-dd\\'T\\'HH:mm:ss[.SSS]X')"
                        }
                    }
                }
            }
        }
    }
}

Step 4: Using the Loading and Transformation CLI

With a pipeline.json file ready to go, you can test this pipeline. To test, use the LAT CLI. For these examples, you can assume that two LATs are configured and will set them using an environment variable. First, configure the LAT CLI to use the hosts of your Loading and Transformation service. You can add these to every CLI command as a flag, but for simplicity you can also set them as environment variables. From a command line, run the following command replacing the IP addresses with the IP addresses of your LAT processes:
Shell
export LAT_HOSTS="https://10.0.0.1:8443,https://10.0.0.2:8443"
If your LAT is running without TLS configured, replace the port number of your LAT Hosts with 8080 and the protocol with http://.
Next, check on the status of the LAT:
Shell
lat_client pipeline status
Example response:
Bash
10.0.0.1:8443: Stopped
10.0.0.2:8443: Stopped
Success! This confirms that you can reach the LAT from your CLI. If the status is “Running” it means a pipeline is already executing a pipeline. You are next going to update and start your new pipeline. This example uses secure connections. If you receive an SSL Error when testing, your service cannot be configured to use TLS or you might need to use the --no-verify flag if certificate validation fails.

Step 5: Test the Transformation

The CLI supports previewing a transformation with an example document and the pipeline file. This makes it easy to test your transformations. First, save an example document to your file system to use for this test. For this demo, you can download an example file from https://ocient-docs.s3.amazonaws.com/metabase_samples/jsonl/orders.jsonl and save it to ~/orders.jsonl. Next, make sure the pipeline.json file that you created is stored at ~/pipeline.json. Now that both files are available, run the CLI to preview the results. You can pass the preview command the topic name, the pipeline file, and the sample record file. The response contains the transformed data tied to the destination table and a list of any error records.
Shell
lat_client preview --topic orders --pipeline ~/pipeline.json --records ~/orders.jsonl
Example response:
JSON
{
    "tableRecords": {
        "metabase.public.orders": [
            {
                "id": 1,
                "user_id": 1,
                "product_id": 14,
                "subtotal": 37.65,
                "tax": 2.07,
                "total": 39.72,
                "discount": null,
                "created_at": 1549921227892000000,
                "quantity": 2
            },
            {
                "id": 2,
                "user_id": 1,
                "product_id": 123,
                "subtotal": 110.93,
                "tax": 6.1,
                "total": 117.03,
                "discount": null,
                "created_at": 1526371444580000000,
                "quantity": 3
            },
            {
                "id": 3,
                "user_id": 1,
                "product_id": 105,
                "subtotal": 52.72,
                "tax": 2.9,
                "total": 49.2,
                "discount": 6.42,
                "created_at": 1575670968544000000,
                "quantity": 2
            }
        ]
    },
    "recordErrors": []
}
You can see that the data is transformed and the columns to which each transformed value will be mapped. If there are issues in the values, these will appear in the recordErrors object. You can quickly update your pipeline.json file and preview again. Now, you can inspect different documents to confirm that various states of data cleanliness like missing columns, null values, and special characters are well handled by your transformations.

Step 6: Configure and Start the Data Pipeline

With a tested transformation, the next step is to set up and start the data pipeline. First, configure the pipeline using the pipeline create command. This validates and creates the pipeline, but will not take effect until you start the pipeline:
Shell
lat_client pipeline create --pipeline ~/pipeline.json
Example response:
Bash
10.0.0.1:8443: Created
10.0.0.2:8443: Created
In cases where there is an existing pipeline operating, it is necessary to stop the pipeline and remove the original pipeline before creating and starting the new pipeline.
Now that the pipeline has been created on all LAT Nodes, you can start the LAT by running the pipeline start commands:
Shell
lat_client pipeline start
Example responses:
Bash
10.0.0.1:8443: Running
10.0.0.2:8443: Running

Step 7: Confirm that Loading is Operating Correctly

With your pipeline in place and running, data will immediately begin loading off of the Kafka topics that are configured in the pipeline. If you do not have data in the Kafka topics yet, now would be a good time to start producing data into the topics.

Producing Test Data into Kafka:

For test purposes, kafkacat is a helpful utility that makes it easy to product records into a topic. For example, if you have a file of sample data orders.jsonl in a JSONL format (newline delimited JSON records), you can run the following command to send those records into your Kafka broker:
Shell
kafkacat -b <broker_ip_address>:9092 -t <topic_name> -T -P -l orders.jsonl
Assuming your broker is running at 10.0.0.3 and you want to send data into the four topics defined in your pipeline.json definition, you can run:
Shell
kafkacat -b 10.0.0.3:9092 -t orders -T -P -l orders.jsonl
kafkacat -b 10.0.0.3:9092 -t products -T -P -l products.jsonl
kafkacat -b 10.0.0.3:9092 -t people -T -P -l people.jsonl
kafkacat -b 10.0.0.3:9092 -t reviews -T -P -l reviews.jsonl
Each of these commands will push the entire JSONL file of messages into Kafka with one record per line. As these are produced into Kafka, your running pipeline will begin loading them into Ocient.

Observing Loading Progress:

With data in Kafka, our pipeline will begin loading data immediately and streaming any new data into Ocient. To observe this progress, you can monitor the metrics endpoint of the Loading and Transformation Nodes. This can be done manually from a command line or from a tool like . For this example, you can run a curl command against the endpoint and review the result. For details on metrics, see the LAT Metrics Documentation. Command:
CURL
curl https://127.0.0.1:8443/v2/metrics/lat:type=pipeline
If your LAT is running without TLS configured, replace the port number of your LAT Hosts with 8080 and the protocol with http://.
Example response:
JSON
{
  "request": {
  "mbean": "lat:type=pipeline",
  "type": "read"
  },
  "value": {
  "partitions": [
    {
        "offsets_durable": 1,
        "pushes_errors": 0,
        "pushes_attempts": 1,
        "rows_pushed": 1,
        "offsets_written": 18759,
        "records_buffered": 0,
        "records_errors_column": 0,
        "records_errors_deserialization": 0,
        "source_bytes_buffered": 0,
        "records_errors_transformation": 0,
        "offsets_processed": 18759,
        "partition": "table_orders-0",
        "records_filter_accepted": 1,
        "records_errors_row": 0,
        "records_filter_rejected": 0,
        "records_errors_generic": 0,
        "producer_send_attempts": 0,
        "offsets_pushed": 18759,
        "pushes_unacknowledged": 0,
        "invalid_state": 0,
        "bytes_pushed": 88,
        "records_errors_total": 0,
        "offsets_buffered": 18759,
        "complete": 0,
        "offsets_end": 1,
        "producer_send_errors": 0
    },
    {
        "offsets_durable": 1,
        "pushes_errors": 0,
        "pushes_attempts": 1,
        "rows_pushed": 1,
        "offsets_written": 2499,
        "records_buffered": 0,
        "records_errors_column": 0,
        "records_errors_deserialization": 0,
        "source_bytes_buffered": 0,
        "records_errors_transformation": 0,
        "offsets_processed": 2499,
        "partition": "table_people-0",
        "records_filter_accepted": 1,
        "records_errors_row": 0,
        "records_filter_rejected": 0,
        "records_errors_generic": 0,
        "producer_send_attempts": 0,
        "offsets_pushed": 2499,
        "pushes_unacknowledged": 0,
        "invalid_state": 0,
        "bytes_pushed": 223,
        "records_errors_total": 0,
        "offsets_buffered": 2499,
        "complete": 0,
        "offsets_end": 1,
        "producer_send_errors": 0
      },
      {
        "offsets_durable": 1,
        "pushes_errors": 0,
        "pushes_attempts": 1,
        "rows_pushed": 1,
        "offsets_written": 199,
        "records_buffered": 0,
        "records_errors_column": 0,
        "records_errors_deserialization": 0,
        "source_bytes_buffered": 0,
        "records_errors_transformation": 0,
        "offsets_processed": 199,
        "partition": "table_products-0",
        "records_filter_accepted": 1,
        "records_errors_row": 0,
        "records_filter_rejected": 0,
        "records_errors_generic": 0,
        "producer_send_attempts": 0,
        "offsets_pushed": 199,
        "pushes_unacknowledged": 0,
        "invalid_state": 0,
        "bytes_pushed": 145,
        "records_errors_total": 0,
        "offsets_buffered": 199,
        "complete": 0,
        "offsets_end": 1,
        "producer_send_errors": 0
      },
      {
        "offsets_durable": 1,
        "pushes_errors": 0,
        "pushes_attempts": 1,
        "rows_pushed": 1,
        "offsets_written": 1111,
        "records_buffered": 0,
        "records_errors_column": 0,
        "records_errors_deserialization": 0,
        "source_bytes_buffered": 0,
        "records_errors_transformation": 0,
        "offsets_processed": 1111,
        "partition": "table_reviews-0",
        "records_filter_accepted": 1,
        "records_errors_row": 0,
        "records_filter_rejected": 0,
        "records_errors_generic": 0,
        "producer_send_attempts": 0,
        "offsets_pushed": 1111,
        "pushes_unacknowledged": 0,
        "invalid_state": 0,
        "bytes_pushed": 184,
        "records_errors_total": 0,
        "offsets_buffered": 1111,
        "complete": 0,
        "offsets_end": 1,
        "producer_send_errors": 0
      }
  ],
  "paused": 1,
  "bytes_buffered": 0,
  "workers": 20
  },
  "timestamp": 1626970368,
  "status": 200
}

Check Row Counts in Tables:

To confirm that you are seeing results in the target tables, you can also run some simple queries to check row counts. Depending on the streamloader role settings, the time for records to become queryable can vary from a few seconds to minutes: Example Queries:
SQL
Ocient> SELECT count(*) FROM public.orders;
count(*)
------------------------
18760
SQL

Ocient> SELECT count() FROM public.people;
count()
-----------------------
2500
SQL
Ocient> SELECT count(*) FROM public.products;
count(*)
--------------------
200
SQL
Ocient> SELECT count(*) FROM public.reviews;
count(*)
--------------------
1112
Success! Now you can explore the data in these four tables with any Ocient SQL queries. If more data is pushed into these topics, your pipeline is still running and will automatically load all new data.

Check Errors

In this example, all rows load successfully. However, a successful load does not always happen, and you can inspect errors using the LAT Client. Whenever the LAT process fails to parse a file correctly or fails to transform or load a record, the LAT process records an error. The LAT Client includes the lat_client pipeline errors command that reports the latest errors on the pipeline. A full error log is available on the Loader Nodes. These logs report all bad records and the reason that the load fails.
When you load a pipeline from Kafka, the load might route errors to an error topic on the Kafka broker instead of the logs. The LAT Client does not contain the errors sent to the error topic. You can inspect these errors with Kafka utilities instead.
This LAT Client command displays a maximum of 100 error messages.
lat_client pipeline errors --max-errors 100 --only-error-messages

|--------------------------------------------------|
| exception_message                                |
|--------------------------------------------------|
| Column name: time1. Message: Failed to evaluate  |
| expression. Cause:                               |
| java.time.format.DateTimeParseException          |
|--------------------------------------------------|
| Column name: time1. Message: Failed to evaluate  |
| expression. Cause:                               |
| java.time.format.DateTimeParseException          |
|--------------------------------------------------|
The errors indicate that there is an issue parsing the time1 column. Options exist on the pipeline errors command to return JSON and to restrict the response to specific components of the error detail that includes a reference to the source location of this record. The following command returns JSON that is delimited with newline characters. You can pass the JSON output to jq or a file. The JSON includes the source topic or file group, the filename where the error occurred, the offset that indicates the line number or Kafka offset, and the exception message that aids in troubleshooting and identifying the incorrect record in the source data. You can use the log_original_message pipeline setting to provide direct access to the parsed source record for errors when appropriate.
lat_client pipeline errors --max-errors 100 --json

{"time": "2022-05-17T16:53:50.387386+00:00", "topic": "calcs", "partition": 0, "state": "TRANSFORMATION_ERROR", "exception_message": "Column name: time1. Message: Failed to evaluate expression. Cause: java.time.format.DateTimeParseException: Cannot parse time \"19:36:22\" with format string \"Value(HourOfDay,2)Offset(+HHmm,'Z')':'Value(MinuteOfHour,2)':'Value(SecondOfMinute,2)\"\njava.time.format.DateTimeParseException: Cannot parse time \"19:36:22\" with format string \"Value(HourOfDay,2)Offset(+HHmm,'Z')':'Value(MinuteOfHour,2)':'Value(SecondOfMinute,2)\"\nCannot parse time \"19:36:22\" with format string \"Value(HourOfDay,2)Offset(+HHmm,'Z')':'Value(MinuteOfHour,2)':'Value(SecondOfMinute,2)\"", "offset": 0, "record": null, "metadata": {"size": "3321", "filename": "calcs/csv/calcs_01.csv"}}
{"time": "2022-05-17T16:53:50.404684+00:00", "topic": "calcs", "partition": 0, "state": "TRANSFORMATION_ERROR", "exception_message": "Column name: time1. Message: Failed to evaluate expression. Cause: java.time.format.DateTimeParseException: Cannot parse time \"02:05:25\" with format string \"Value(HourOfDay,2)Offset(+HHmm,'Z')':'Value(MinuteOfHour,2)':'Value(SecondOfMinute,2)\"\njava.time.format.DateTimeParseException: Cannot parse time \"02:05:25\" with format string \"Value(HourOfDay,2)Offset(+HHmm,'Z')':'Value(MinuteOfHour,2)':'Value(SecondOfMinute,2)\"\nCannot parse time \"02:05:25\" with format string \"Value(HourOfDay,2)Offset(+HHmm,'Z')':'Value(MinuteOfHour,2)':'Value(SecondOfMinute,2)\"", "offset": 1, "record": null, "metadata": {"size": "3321", "filename": "calcs/csv/calcs_01.csv"}}
LAT Overview LAT Data Types in Loading LAT Advanced Topics
Last modified on May 27, 2026