Prerequisites
This tutorial assumes that:- The Ocient System has network access to S3 from the Loader Nodes.
- An Ocient System is installed and configured with an active Storage Cluster (see the Ocient Application Configuration guide).
Step 1: Create a New Database
Connect to a SQL Node using the Commands Supported by the Ocient JDBC CLI Program. Then, execute theCREATE DATABASE SQL statement for a database named metabase.
SQL
Step 2: Create Tables
Create two target tables for loading data for two different products. Create theproduct129 table in the new database. First, connect to that database (e.g., connect to jdbc:ocient://sql-node:4050/metabase), and then execute this CREATE TABLE SQL statement that specifies to create a table with these columns and a clustering index based on the user_id and product_id columns:
created_atas a timestamp that is not nullable.id,user_id, andproduct_idas integers that are not nullable.subtotal,tax,total, anddiscountas floating point numbers.quantityas an integer.
SQL
product161 table in the same way.
SQL
product129and product161 tables, and you can begin loading data.
Step 3: Create a Data Pipeline
Create data pipelines using theCREATE PIPELINE SQL statement. To load data, you first create a pipeline with the definition of the source, data format, and transformation rules using a SQL-like declarative syntax. Then, you execute the START PIPELINE SQL statement to start the load. You can observe progress and status using system tables and views.
Each Ocient pipeline defines a single data source and the target table or tables into which data loads. A data source includes the location of the source and filters on the source to define the specific data set to load. This example loads data from a data source located in a directory within the S3 bucket.
First, inspect the data that you plan to load. Each document has a format similar to this example CSV file named orders.csv.
Text
orders_pipeline for the orders data set from your database connection prompt. Use the S3 data source with endpoint https://s3.us-east-1.amazonaws.com, bucket ocient-docs, and filter metabase_samples/csv/orders.csv. Specify the CSV format with one header line. Load the data into the two target tables. The SELECT part of the SQL statement maps the fields in the CSV file to the target columns in the created table.
SOURCE— In this case, load data from S3. Specify the endpoint and bucket. TheFILTERparameter identifies the file or files to load. The load uses a single CSV file. Options exist to add wildcards or regular expressions to isolate different file sets.EXTRACT— Set the format to CSV files and note that there is one header line in the file. This specification skips that row when the Ocient System processes the file. Many other options exist for delimited data such as a record delimiter and field delimiter.INTO ... SELECT— Choose the target tablespublic.product129andpublic.product161and select the fields from the CSV file. The numeric index identifies each file field. Importantly, similar to other SQL syntax, the first field in the file is$1, not$0. Each field maps to a target column using theassyntax. TheWHEREfilter instructs the data pipeline to load order data for only two products, one product for each table.
orders_pipeline pipeline, execute the START PIPELINE SQL statement.
Step 4: Observe the Load Progress
With your pipeline running, data immediately begins to load from the S3 files that you defined. If there are many files in each file group, the load process first sorts the files into batches, partitions them for parallel processing, and assigns them to Loader Nodes. You can check the pipeline status and progress by queryinginformation_schema.pipeline_status or by executing SHOW PIPELINE_STATUS.
COMPLETED, all data is available in the target table.
After a few seconds, the data is available for query in the public.product129 table.
SQL
public.product161 table.
SQL
DROP PIPELINE orders_pipeline; SQL statement. Execution of this statement leaves the data in your target table, but removes metadata about the pipeline execution from the system.

