Load Data
Load CSV Data from S3

A common setup for batch loading files into  is to load from a bucket on  S3 with time partitioned data. In many instances, a batch load is performed on a recurring basis to load new files. The LAT transforms each document into rows in one or more different tables. Ocient’s Loading and Transformation capabilities use a simple SQL-like syntax for transforming data. This tutorial will guide users through a simple example load using a small set of data in CSV format. The data in this example is created from a test set for the  Business Intelligence tool.
Prerequisites
This tutorial assumes that:
The Ocient System has network access to S3 from the Loader Nodes.
An Ocient System is installed and configured with an active Storage Cluster (See the Ocient Application Configuration﻿ guide).
Loading and Transformation is installed on the Loader Nodes.
A default "sink" for the Ocient Loader Nodes is configured on the system.
The LAT Client Command Line Interface﻿ is installed.
Step 1: Create a New Database
To begin, load two example tables in a database. First, connect to a SQL Node using the Commands Supported by the Ocient JDBC CLI Program ﻿. Then run the following DDL command:
SQL
CREATE DATABASE metabase;
CREATE DATABASE metabase;
﻿
Step 2: Create Tables
To create tables in the new database, first connect to that database (e.g., connect to jdbc:ocient://sql-node:4050/metabase), then run the following DDL commands:
SQL
CREATE TABLE public.orders (
  created_at TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
  id INT NOT NULL,
  user_id INT NOT NULL,
  product_id INT NOT NULL,
  subtotal DOUBLE,
  tax DOUBLE,
  total DOUBLE,
  discount DOUBLE,
  quantity INT,
  CLUSTERING INDEX idx01 (user_id, product_id)
);

CREATE TABLE public.products(
  created_at TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
  id INT NOT NULL,
  ean VARCHAR(255),
  title VARCHAR(255),
  category VARCHAR(255) COMPRESSION GDC(2) NOT NULL,
  vendor VARCHAR(255),
  price DOUBLE,
  rating DOUBLE,
  CLUSTERING INDEX idx01 (category)
);
CREATE TABLE public.orders (
  created_at TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
  id INT NOT NULL,
  user_id INT NOT NULL,
  product_id INT NOT NULL,
  subtotal DOUBLE,
  tax DOUBLE,
  total DOUBLE,
  discount DOUBLE,
  quantity INT,
  CLUSTERING INDEX idx01 (user_id, product_id)
);

CREATE TABLE public.products(
  created_at TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
  id INT NOT NULL,
  ean VARCHAR(255),
  title VARCHAR(255),
  category VARCHAR(255) COMPRESSION GDC(2) NOT NULL,
  vendor VARCHAR(255),
  price DOUBLE,
  rating DOUBLE,
  CLUSTERING INDEX idx01 (category)
);
﻿
Now, the database tables are created and you can begin loading data.
Step 3: Create a Data Pipeline
Data pipelines are created using a simple loading configuration that is submitted to the Transformation nodes to start loading. File Groups designate a batch of files to load. Each File Group is routed to one or more Ocient tables, and each column is the result of a transformation applied to the source document.
First, inspect the data that you plan to load. Each document has a format similar to the following example:
Text
/* orders */
id,user_id,product_id,subtotal,tax,total,discount,created_at,quantity
1,1,14,37.65,2.07,39.72,null,2019-02-11T21:40:27.892Z,2
2,1,123,110.93,6.1,117.03,null,2018-05-15T08:04:04.580Z,3
3,1,105,52.72,2.9,49.2,6.42,2019-12-06T22:22:48.544Z,2
...

/* products */
id,ean,title,category,vendor,price,rating,created_at
1,1018947080336,Rustic Paper Wallet,Gizmo,"Swaniawski, Casper and Hilll",29.46,4.6,2017-07-19T19:44:56.582Z
2,7663515285824,Small Marble Shoes,Doohickey,Balistreri-Ankunding,70.08,0,2019-04-11T08:49:35.932Z
3,4966277046676,Synergistic Granite Chair,Doohickey,"Murray, Watsica and Wunsch",35.39,4,2018-09-08T22:03:20.239Z
...
/* orders */
id,user_id,product_id,subtotal,tax,total,discount,created_at,quantity
1,1,14,37.65,2.07,39.72,null,2019-02-11T21:40:27.892Z,2
2,1,123,110.93,6.1,117.03,null,2018-05-15T08:04:04.580Z,3
3,1,105,52.72,2.9,49.2,6.42,2019-12-06T22:22:48.544Z,2
...

/* products */
id,ean,title,category,vendor,price,rating,created_at
1,1018947080336,Rustic Paper Wallet,Gizmo,"Swaniawski, Casper and Hilll",29.46,4.6,2017-07-19T19:44:56.582Z
2,7663515285824,Small Marble Shoes,Doohickey,Balistreri-Ankunding,70.08,0,2019-04-11T08:49:35.932Z
3,4966277046676,Synergistic Granite Chair,Doohickey,"Murray, Watsica and Wunsch",35.39,4,2018-09-08T22:03:20.239Z
...
﻿
This is similar to the target schema created in Step 2, but it will require some transformation. Most transformations are identical to functions already in Ocient’s SQL dialect. To route data to the tables, you need to create a pipeline.json file that has the following structure:
JSON
{
    "version": 2,
    "workers": 4,
    "source": {
        "compression": "none",
        "type": "s3",
        "endpoint": "https://s3.us-east-1.amazonaws.com",
        "bucket": "ocient-examples",
        "file_groups": {
            "orders": {
                "prefix": "metabase_samples/csv",
                "file_matcher_syntax": "glob",
                "file_matcher_pattern": "**orders*.csv",
                "sort_type": "lexicographic"
            },
            "products": {
                "prefix": "metabase_samples/csv",
                "file_matcher_syntax": "glob",
                "file_matcher_pattern": "**products*.csv",
                "sort_type": "lexicographic"
            }
        }
    },
    "extract": {
        "orders": {
            "record_type": "delimited",
            "headers": [
                "id",
                "user_id",
                "product_id",
                "subtotal",
                "tax",
                "total",
                "discount",
                "created_at",
                "quantity"
            ],
            "null_strings": [
                "NULL",
                "N/A"
            ],
            "record_delimiter": "\r\n",
            "num_header_lines": 1
        },
        "products": {
            "record_type": "delimited",
            "headers": [
                "id",
                "ean",
                "title",
                "category",
                "vendor",
                "price",
                "rating",
                "created_at"
            ],
            "record_delimiter": "\r\n",
            "num_header_lines": 1
        }
    },
    "transform": {
        "file_groups": {
            "orders": {
                "tables": {
                    "metabase.public.orders": {
                        "columns": {
                            "id": "id",
                            "user_id": "user_id",
                            "product_id": "product_id",
                            "subtotal": "subtotal",
                            "tax": "tax",
                            "total": "total",
                            "discount": "discount",
                            "created_at": "to_timestamp(created_at, 'yyyy-MM-dd\\'T\\'HH:mm:ss[.SSS]X')",
                            "quantity": "quantity"
                        }
                    }
                }
            },
            "products": {
                "tables": {
                    "metabase.public.products": {
                        "columns": {
                            "id": "id",
                            "ean": "ean",
                            "title": "title",
                            "category": "category",
                            "vendor": "vendor",
                            "price": "price",
                            "rating": "rating",
                            "created_at": "to_timestamp(created_at, 'yyyy-MM-dd\\'T\\'HH:mm:ss[.SSS]X')"
                        }
                    }
                }
            }
        }
    }
}
{
    "version": 2,
    "workers": 4,
    "source": {
        "compression": "none",
        "type": "s3",
        "endpoint": "https://s3.us-east-1.amazonaws.com",
        "bucket": "ocient-examples",
        "file_groups": {
            "orders": {
                "prefix": "metabase_samples/csv",
                "file_matcher_syntax": "glob",
                "file_matcher_pattern": "**orders*.csv",
                "sort_type": "lexicographic"
            },
            "products": {
                "prefix": "metabase_samples/csv",
                "file_matcher_syntax": "glob",
                "file_matcher_pattern": "**products*.csv",
                "sort_type": "lexicographic"
            }
        }
    },
    "extract": {
        "orders": {
            "record_type": "delimited",
            "headers": [
                "id",
                "user_id",
                "product_id",
                "subtotal",
                "tax",
                "total",
                "discount",
                "created_at",
                "quantity"
            ],
            "null_strings": [
                "NULL",
                "N/A"
            ],
            "record_delimiter": "\r\n",
            "num_header_lines": 1
        },
        "products": {
            "record_type": "delimited",
            "headers": [
                "id",
                "ean",
                "title",
                "category",
                "vendor",
                "price",
                "rating",
                "created_at"
            ],
            "record_delimiter": "\r\n",
            "num_header_lines": 1
        }
    },
    "transform": {
        "file_groups": {
            "orders": {
                "tables": {
                    "metabase.public.orders": {
                        "columns": {
                            "id": "id",
                            "user_id": "user_id",
                            "product_id": "product_id",
                            "subtotal": "subtotal",
                            "tax": "tax",
                            "total": "total",
                            "discount": "discount",
                            "created_at": "to_timestamp(created_at, 'yyyy-MM-dd\\'T\\'HH:mm:ss[.SSS]X')",
                            "quantity": "quantity"
                        }
                    }
                }
            },
            "products": {
                "tables": {
                    "metabase.public.products": {
                        "columns": {
                            "id": "id",
                            "ean": "ean",
                            "title": "title",
                            "category": "category",
                            "vendor": "vendor",
                            "price": "price",
                            "rating": "rating",
                            "created_at": "to_timestamp(created_at, 'yyyy-MM-dd\\'T\\'HH:mm:ss[.SSS]X')"
                        }
                    }
                }
            }
        }
    }
}
﻿
The two interesting parts of this pipeline.json file are the way the file groups and the extraction settings are defined. 
First, note that each file group sets the S3 endpoint, a bucket, a prefix used for filtering the considered files, and then a file matcher. In this example, there is only a single file, but if there were many files matching the pattern **orders*.csv then they would all be part of the file group. You do not need S3 credentials because this is a public bucket, but if it were private, you can supply credentials in a few different ways.
Next, note the extract section. Here, the example specifies a delimited format. If it were compressed (e.g., gzip), you could specify compression. The example also specifies the headers to associate with each column in the CSV data. Delimited extracts can specify different record delimiters (e.g., \n), specify field delimiters (e.g., |, \t, ,), define how to handle empty fields or to trim whitespace, and specify strings that should be considered NULL. While not used in this file, an example of null strings is provided that would turn the string literal "NULL" or "N/A" into a database NULL.
The final parameter supplied in the example is the sort type for the file load. This informs the LAT how you would like data to be ordered when loading. The ideal sort organizes files in time order according to the defined . This makes more efficient segments and is much faster to load. This example uses the lexicographic sort which orders according to the characters in the file name. Other sort types are available to use file modified time or to extract the timestamp for sorting from the file path or file name.
Step 4: Using the Loading and Transformation CLI
With a pipeline.json file ready to go, you can test this pipeline. To test, use the LAT CLI. For these examples, assume that two LATs are configured and set via an environment variable.
First, configure the LAT CLI to use the hosts of our Loading and Transformation service. You can add these to every CLI command as a flag, but for simplicity you can also set them as environment variables. From a command line, run the following command replacing the IP addresses with the IP addresses of your LAT processes:
Shell
export LAT_HOSTS="https://10.0.0.1:8443,https://10.0.0.2:8443"
export LAT_HOSTS="https://10.0.0.1:8443,https://10.0.0.2:8443"
﻿
Next, check on the status of the LAT:
Shell
lat_client pipeline status
lat_client pipeline status
﻿
Example response:
Bash
10.0.0.1:8443: Stopped
10.0.0.2:8443: Stopped
10.0.0.1:8443: Stopped
10.0.0.2:8443: Stopped
﻿
Success! This confirms that you can reach the LAT from our CLI. If the status is "Running" it means a pipeline is already executing a pipeline. In the next step, you will update and start the new pipeline.
This example uses secure connections. If you receive an SSL Error when testing, your service cannot be configured to use TLS or you might need to use the --no-verify flag if certificate validation fails.
Step 5: Test the Transformation
The CLI supports previewing a transformation with an example document and the pipeline file. This makes it easy to test your transformations.
First, save an example document to your file system to use for this test. For this demo, you can download an example file from https://ocient-examples.s3.amazonaws.com/metabase_samples/csv/orders.csv and save it to ~/orders.csv.
Next, make sure the pipeline.json file that you created is stored at ~/pipeline.json.
Now that both files are available, you can run the CLI to preview the results. Pass the preview command the topic name, the pipeline file, and the sample record file. The response contains the transformed data tied to the destination table and a list of any error records.
Similar to how you can preview records on a  topic for file loads, you can supply any one file_groups created in the extract section to preview the transformations.
Shell
lat_client preview --topic orders --pipeline ~/pipeline.json --records ~/orders.csv
lat_client preview --topic orders --pipeline ~/pipeline.json --records ~/orders.csv
﻿
Example response:
JSON
{
    "tableRecords": {
        "metabase.public.orders": [
            {
                "id": 1,
                "user_id": 1,
                "product_id": 14,
                "subtotal": 37.65,
                "tax": 2.07,
                "total": 39.72,
                "discount": null,
                "created_at": 1549921227892000000,
                "quantity": 2
            },
            {
                "id": 2,
                "user_id": 1,
                "product_id": 123,
                "subtotal": 110.93,
                "tax": 6.1,
                "total": 117.03,
                "discount": null,
                "created_at": 1526371444580000000,
                "quantity": 3
            },
            {
                "id": 3,
                "user_id": 1,
                "product_id": 105,
                "subtotal": 52.72,
                "tax": 2.9,
                "total": 49.2,
                "discount": 6.42,
                "created_at": 1575670968544000000,
                "quantity": 2
            }
        ]
    },
    "recordErrors": []
}
{
    "tableRecords": {
        "metabase.public.orders": [
            {
                "id": 1,
                "user_id": 1,
                "product_id": 14,
                "subtotal": 37.65,
                "tax": 2.07,
                "total": 39.72,
                "discount": null,
                "created_at": 1549921227892000000,
                "quantity": 2
            },
            {
                "id": 2,
                "user_id": 1,
                "product_id": 123,
                "subtotal": 110.93,
                "tax": 6.1,
                "total": 117.03,
                "discount": null,
                "created_at": 1526371444580000000,
                "quantity": 3
            },
            {
                "id": 3,
                "user_id": 1,
                "product_id": 105,
                "subtotal": 52.72,
                "tax": 2.9,
                "total": 49.2,
                "discount": 6.42,
                "created_at": 1575670968544000000,
                "quantity": 2
            }
        ]
    },
    "recordErrors": []
}
﻿
You can see that the data is transformed and the columns to which each transformed value will be mapped. If there are issues in the values, these will appear in the recordErrors object. You can quickly update the pipeline.json file and preview again. Now, you can inspect different documents to confirm that various states of data cleanliness like missing columns, null values, and special characters are well handled by your transformations.
Step 6: Configure and Start the Data Pipeline
With a tested transformation, the next step is to setup and start the data pipeline.
First, configure the pipeline using the pipeline create command. This validates and creates the pipeline, but will not take effect until you start the pipeline:
Shell
lat_client pipeline create --pipeline ~/pipeline.json
lat_client pipeline create --pipeline ~/pipeline.json
﻿
Example response:
Bash
10.0.0.1:8443: Created
10.0.0.2:8443: Created
10.0.0.1:8443: Created
10.0.0.2:8443: Created
﻿
In cases where there is an existing pipeline operating, it is necessary to stop the pipeline and remove the original pipeline before creating and starting the new pipeline.
Now that the pipeline has been created on all LAT nodes, you can start the LAT by running the pipeline start commands:
Shell
lat_client pipeline start
lat_client pipeline start
﻿
Example responses:
Bash
10.0.0.1:8443: Running
10.0.0.2:8443: Running
10.0.0.1:8443: Running
10.0.0.2:8443: Running
﻿
Step 7: Confirm that Loading is Operating Correctly
With your pipeline in place and running, data will immediately begin loading from the S3 file groups that you defined. If there were many files per file group, the LAT would first sort the files, then partition them for the fastest loading based on the sorting criteria you provided.
Observing Loading Progress
With the pipeline running, data immediately begins to load into Ocient. To observe this progress, you can use the pipeline status command from the LAT Client or monitor the LAT metrics endpoint of the Loader Nodes.
You can check the status with this command by using the --list-files flag to include a summary of the files included in the load.
Shell
lat_client pipeline status --list-files
lat_client pipeline status --list-files
﻿
Example responses:
Bash
10.0.0.1:8443: Running
10.0.0.2:8443: Running

Pipeline Files Processed: 0
Pipeline Error Count: 0
Pipeline Files Remaining: 2

orders
  Files Processed: 0
  Error Count: 0
  Files Remaining: 1

products
  Files Processed: 0
  Error Count: 0
  Files Remaining: 1


orders

Status        Filename
processing    metabase_samples/csv/orders.csv

In Process Files:
processing    metabase_samples/csv/orders.csv


products

Status        Filename
processing    metabase_samples/csv/products.csv

In Process Files:
processing    metabase_samples/csv/products.csv
10.0.0.1:8443: Running
10.0.0.2:8443: Running

Pipeline Files Processed: 0
Pipeline Error Count: 0
Pipeline Files Remaining: 2

orders
  Files Processed: 0
  Error Count: 0
  Files Remaining: 1

products
  Files Processed: 0
  Error Count: 0
  Files Remaining: 1


orders

Status        Filename
processing    metabase_samples/csv/orders.csv

In Process Files:
processing    metabase_samples/csv/orders.csv


products

Status        Filename
processing    metabase_samples/csv/products.csv

In Process Files:
processing    metabase_samples/csv/products.csv
﻿
You can monitor the LAT metrics endpoint manually at the command line. Or, you can use a tool like  to retrieve metrics. For this example, run the curl command against the endpoint and review the result. For details on metrics, see the Metrics﻿ Documentation.
Command:
Curl
curl http://127.0.0.1:8080/v2/metrics/lat:type=pipeline
curl http://127.0.0.1:8080/v2/metrics/lat:type=pipeline
﻿
Example response:
JSON
{
  "request": {
  "mbean": "lat:type=pipeline",
  "type": "read"
  },
  "value": {
  "partitions": [
    {
        "offsets_durable": 1,
        "pushes_errors": 0,
        "pushes_attempts": 1,
        "rows_pushed": 1,
        "offsets_written": 18759,
        "records_buffered": 0,
        "records_errors_column": 0,
        "records_errors_deserialization": 0,
        "source_bytes_buffered": 0,
        "records_errors_transformation": 0,
        "offsets_processed": 18759,
        "partition": "table_orders-0",
        "records_filter_accepted": 1,
        "records_errors_row": 0,
        "records_filter_rejected": 0,
        "records_errors_generic": 0,
        "producer_send_attempts": 0,
        "offsets_pushed": 18759,
        "pushes_unacknowledged": 0,
        "invalid_state": 0,
        "bytes_pushed": 88,
        "records_errors_total": 0,
        "offsets_buffered": 18759,
        "complete": 0,
        "offsets_end": 1,
        "producer_send_errors": 0
    },
    {
    "offsets_durable": 1,
    "pushes_errors": 0,
    "pushes_attempts": 1,
    "rows_pushed": 1,
    "offsets_written": 199,
    "records_buffered": 0,
    "records_errors_column": 0,
    "records_errors_deserialization": 0,
    "source_bytes_buffered": 0,
    "records_errors_transformation": 0,
    "offsets_processed": 199,
    "partition": "table_products-0",
    "records_filter_accepted": 1,
    "records_errors_row": 0,
    "records_filter_rejected": 0,
    "records_errors_generic": 0,
    "producer_send_attempts": 0,
    "offsets_pushed": 199,
    "pushes_unacknowledged": 0,
    "invalid_state": 0,
    "bytes_pushed": 145,
    "records_errors_total": 0,
    "offsets_buffered": 199,
    "complete": 0,
    "offsets_end": 1,
    "producer_send_errors": 0
    }
  ],
  "paused": 1,
  "bytes_buffered": 0,
  "workers": 20
  },
  "timestamp": 1626970368,
  "status": 200
}
{
  "request": {
  "mbean": "lat:type=pipeline",
  "type": "read"
  },
  "value": {
  "partitions": [
    {
        "offsets_durable": 1,
        "pushes_errors": 0,
        "pushes_attempts": 1,
        "rows_pushed": 1,
        "offsets_written": 18759,
        "records_buffered": 0,
        "records_errors_column": 0,
        "records_errors_deserialization": 0,
        "source_bytes_buffered": 0,
        "records_errors_transformation": 0,
        "offsets_processed": 18759,
        "partition": "table_orders-0",
        "records_filter_accepted": 1,
        "records_errors_row": 0,
        "records_filter_rejected": 0,
        "records_errors_generic": 0,
        "producer_send_attempts": 0,
        "offsets_pushed": 18759,
        "pushes_unacknowledged": 0,
        "invalid_state": 0,
        "bytes_pushed": 88,
        "records_errors_total": 0,
        "offsets_buffered": 18759,
        "complete": 0,
        "offsets_end": 1,
        "producer_send_errors": 0
    },
    {
    "offsets_durable": 1,
    "pushes_errors": 0,
    "pushes_attempts": 1,
    "rows_pushed": 1,
    "offsets_written": 199,
    "records_buffered": 0,
    "records_errors_column": 0,
    "records_errors_deserialization": 0,
    "source_bytes_buffered": 0,
    "records_errors_transformation": 0,
    "offsets_processed": 199,
    "partition": "table_products-0",
    "records_filter_accepted": 1,
    "records_errors_row": 0,
    "records_filter_rejected": 0,
    "records_errors_generic": 0,
    "producer_send_attempts": 0,
    "offsets_pushed": 199,
    "pushes_unacknowledged": 0,
    "invalid_state": 0,
    "bytes_pushed": 145,
    "records_errors_total": 0,
    "offsets_buffered": 199,
    "complete": 0,
    "offsets_end": 1,
    "producer_send_errors": 0
    }
  ],
  "paused": 1,
  "bytes_buffered": 0,
  "workers": 20
  },
  "timestamp": 1626970368,
  "status": 200
}
﻿
Check Row Counts in Tables
To confirm that you are seeing results in the target tables, you can also run some simple queries to check row counts. Depending on the streamloader role settings, the time for records to become queryable can vary from a few seconds to minutes:
Example Queries:
SQL
Ocient> SELECT count(*) FROM public.orders;
COUNT(*)
---------------------
18760
Ocient> SELECT count(*) FROM public.orders;
COUNT(*)
---------------------
18760
﻿
SQL
Ocient> SELECT count() FROM public.products; 
COUNT()
--------------------
200
Ocient> SELECT count() FROM public.products; 
COUNT()
--------------------
200
﻿
Now you can explore the data in these four tables with any Ocient SQL queries. 
﻿
﻿
Updated 17 Jan 2024
Did this page help you?
Load JSON Data from Kafka
Load JSON Data from S3