Load and Analyze Data - Ocient Documentation

This tutorial details the core workflow from start to finish. You connect to the system, create a database and table, load sample data from a source, validate the loaded data, and execute analytic queries. The example uses a sample data set that covers every supported Ocient data type. By the end of this process, you have a working table with loaded data that you can query immediately.

Prerequisites

The tutorial requires:

Network access to the SQL Nodes in your Ocient System and the IP address or hostname of at least one SQL Node.
A valid username and password with permissions to create databases, tables, and data pipelines.
You have installed the Ocient JDBC driver and CLI on your local machine. For setup instructions, see Connect Using JDBC.

If you do not have the JDBC driver installed, you can also connect using pyocient, the Ocient driver. For setup instructions, see Connect Using pyocient.

Step 1: Connect to the Ocient System

Start the JDBC CLI and connect to a SQL Node. Replace the hostname <HOSTNAME_OR_IP> and port <PORT> placeholders with the values for your system. The default SQL Node port is 4050.

SQL

CONNECT TO jdbc:ocient://<HOSTNAME_OR_IP>:<PORT>/system;

After a successful connection, you see the Ocient CLI prompt.

Shell

Ocient> _

For details on connection methods and troubleshooting, see Connect to Ocient.

Step 2: Create a Database and Schema

Create a database to hold your sample data. Using a dedicated database keeps your work isolated from other applications on the same system.

SQL

CREATE DATABASE test;

Switch your connection to the new database.

SQL

CONNECT TO jdbc:ocient://<HOSTNAME_OR_IP>:<PORT>/test;

Create a schema within the database to organize your tables.

SQL

CREATE SCHEMA loading;

For details on managing databases and schemas, see Databases and Schemas.

Step 3: Create a Table

Create a table that covers all supported Ocient data types. This table serves as the target for the data pipeline in the next step. The CREATE TABLE SQL statement defines the column names, data types, and a on the col_timestamp column in the data_type_coverage table. The TimeKey partitions data by time so that the system can skip irrelevant segments during time-filtered queries.

SQL

CREATE TABLE loading.data_type_coverage (
    col_bigint     BIGINT,
    col_binary     BINARY(2),
    col_boolean    BOOLEAN,
    col_char       CHAR(64),
    col_date       DATE,
    col_decimal    DECIMAL(18, 4),
    col_double     DOUBLE,
    col_float      FLOAT,
    col_int        INT,
    col_int_array  INT[],
    col_ipv4       IPV4,
    col_ip         IP,
    col_smallint   SMALLINT,
    col_point      POINT,
    col_linestring LINESTRING,
    col_polygon    POLYGON,
    col_time       TIME,
    col_timestamp  TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
    col_tinyint    TINYINT,
    col_tuple      TUPLE<<INT, VARCHAR(255)>> NULL,
    col_uuid       UUID,
    col_varbinary  VARBINARY,
    col_varchar    VARCHAR
);

This table intentionally keeps the schema simple to focus on the loading workflow. In production, you should also define a Clustering Key and secondary indexes based on your expected query patterns. For details on table design options, see:

TimeKeys and Clustering Keys — Segment keys that partition and order data for faster queries.
Secondary Indexes — Additional indexes for columns used in filters.
Table Compression Options — Compression settings that reduce storage requirements.
CREATE TABLE SQL Statement Examples — Examples of table definitions for different use cases.

Step 4: Create a Data Pipeline

Data pipelines are the way to load data into an Ocient System. Each pipeline is a SQL object that defines the source, data format, and transformations for loading rows into one or more tables. For a full overview of pipeline concepts, see Load Data.

Use a Sample Data Set

The sample data is a set of gzip-compressed CSV files hosted in a public S3 bucket. Each file contains one header row followed by data rows with columns covering every Ocient data type.

Create the Pipeline

This data pipeline reads CSV files from the S3 source, parses each field using named headers, applies type conversions where needed, and loads the results into the data_type_coverage table.

SQL

CREATE BATCH PIPELINE data_type_coverage_pipeline
    SOURCE S3
        ENDPOINT 'https://s3.us-east-1.amazonaws.com'
        BUCKET 'ocient-docs'
        PREFIX 'all_data_types/small/'
        FILTER_GLOB '**output*.csv.gz'
        COMPRESSION_METHOD 'gzip'
        SORT_BY 'filename'
    EXTRACT
        FORMAT delimited
        RECORD_DELIMITER e'\n'
        FIELD_DELIMITER ','
        FIELD_OPTIONALLY_ENCLOSED_BY '"'
        NUM_HEADER_LINES 1
        HEADERS [
            'bigint', 'binary', 'boolean', 'char',
            'date', 'decimal', 'double', 'float',
            'int', 'int_array', 'ipv4', 'ip',
            'smallint', 'point', 'linestring', 'polygon',
            'time', 'timestamp', 'tinyint', 'tuple',
            'uuid', 'varbinary', 'varchar'
        ]
        OPEN_ARRAY '['
        CLOSE_ARRAY ']'
        ARRAY_ELEMENT_DELIMITER ','
        OPEN_OBJECT '('
        CLOSE_OBJECT ')'
INTO test.loading.data_type_coverage
SELECT
    $bigint                                          AS col_bigint,
    $binary                                          AS col_binary,
    $boolean                                         AS col_boolean,
    $char                                            AS col_char,
    $date                                            AS col_date,
    $decimal                                         AS col_decimal,
    DOUBLE(IF(LOWER($double) = 'nan',NULL, $double)) AS col_double,
    FLOAT(IF(LOWER($float) = 'nan', NULL, $float))   AS col_float,
    $int                                             AS col_int,
    INT[]($int_array)                                AS col_int_array,
    $ipv4                                            AS col_ipv4,
    $ip                                              AS col_ip,
    $smallint                                        AS col_smallint,
    $point                                           AS col_point,
    $linestring                                      AS col_linestring,
    $polygon                                         AS col_polygon,
    $time                                            AS col_time,
    REPLACE(REPLACE($timestamp, 'T', ' '), 'Z', '')  AS col_timestamp,
    $tinyint                                         AS col_tinyint,
    $tuple                                           AS col_tuple,
    $uuid                                            AS col_uuid,
    VARBINARY(IF($varbinary IS NULL, NULL,
        IF(LENGTH($varbinary) % 2 = 1, CONCAT('0', $varbinary), $varbinary)))
        											 AS col_varbinary,
    $varchar                                         AS col_varchar;

The pipeline has three sections:

SOURCE — Identifies the S3 bucket, file path prefix, and glob filter. The COMPRESSION_METHOD parameter indicates that the source files are gzip-compressed.
EXTRACT — Defines the CSV format options, including delimiters, quoting, and the header names that the SELECT clause references. The OPEN_ARRAY and CLOSE_ARRAY, and OPEN_OBJECT and CLOSE_OBJECT parameters tell the parser how array and tuple values are encoded in the CSV data.
SELECT — Maps each source field to a target column. Some fields require explicit type conversion. For example, DOUBLE($double) converts the extracted string to a DOUBLE value, and the nested REPLACE functions strip the T and Z characters from timestamps to match the Ocient TIMESTAMP format. This VARBINARY expression pads odd-length hex strings with a leading zero to ensure all rows that are not NULL convert to the VARBINARY data type.

For details on pipeline syntax and options, see Data Pipelines. For details on supported data formats, see Data Formats for Data Pipelines. For details on data type handling during loading, see Data Types for Data Pipelines. For the functions, see Transform Data in Data Pipelines.

Step 5: Start and Monitor the Pipeline

Start the data pipeline to begin loading data using the START PIPELINE SQL statement.

SQL

START PIPELINE data_type_coverage_pipeline;

Check the pipeline status while it runs.

SQL

SHOW PIPELINE_STATUS;

Output

Text

database_name  pipeline_name                  table_names                       status   percent_complete  records_processed  records_loaded  records_failed
-------------- ------------------------------ --------------------------------- -------- ---------------- ------------------ --------------- ---------------
test           data_type_coverage_pipeline    ["loading.data_type_coverage"]    RUNNING  0.0               0                  0               0

Wait for the status column to show the COMPLETED status. You can execute the SHOW PIPELINE_STATUS statement again to check progress. For larger data sets, the percent_complete, records_processed, and records_loaded columns update as the pipeline progresses. For more detailed monitoring options, including system catalog tables and metrics endpoints, see Monitor Data Pipelines. If the pipeline enters the FAILED status, see Manage Errors in Data Pipelines and Data Pipeline Loading Errors.

Step 6: Validate the Loaded Data

After the pipeline completes, verify that the data loaded correctly before you begin analysis.

Verify the Row Count

Confirm that the number of loaded rows matches the pipeline metrics.

SQL

SELECT COUNT(*) AS total_rows
FROM loading.data_type_coverage;

Cross-reference this count with the records_loaded value from the SHOW PIPELINE_STATUS statement.

Check the Time Range

Verify that the col_timestamp column covers the expected date range. Gaps in the range can indicate missing source files.

SQL

SELECT
    MIN(col_timestamp) AS earliest,
    MAX(col_timestamp) AS latest,
    COUNT(DISTINCT CAST(col_timestamp AS DATE)) AS distinct_days
FROM loading.data_type_coverage;

Inspect for NULL Values

Check for unexpected NULL values in columns that should be populated.

SQL

SELECT
    COUNT(*)             AS total_rows,
    COUNT(col_bigint)    AS non_null_bigint,
    COUNT(col_double)    AS non_null_double,
    COUNT(col_timestamp) AS non_null_timestamp,
    COUNT(col_varchar)   AS non_null_varchar,
    COUNT(col_int_array) AS non_null_int_array,
    COUNT(col_tuple)     AS non_null_tuple
FROM loading.data_type_coverage;

If a column has significantly fewer non-NULL values than expected, review the source data and the pipeline SELECT transformations for parsing issues.

Sample Records

Review a small set of records to confirm that type conversions applied correctly using the LIMIT keyword.

SQL

SELECT *
FROM loading.data_type_coverage
LIMIT 5;

Step 7: Query the Data

With validated data in the table, you can begin running analytic queries. These examples demonstrate common query patterns across different data types.

Aggregate Numeric Data

Compute summary statistics on the numeric columns. In this case, retrieve the count of rows, the minimum number of the col_int column, the maximum number of the col_int column, and the average of the numbers in the col_smallint column.

SQL

SELECT
    COUNT(*)           AS total_rows,
    MIN(col_int)       AS min_int,
    MAX(col_int)       AS max_int,
    AVG(col_smallint)  AS avg_smallint
FROM loading.data_type_coverage;

Filter by Time Range

Use the col_timestamp TimeKey column to filter data efficiently. The Ocient System skips segments outside the specified time range of January 2024.

SQL

SELECT
    CAST(col_timestamp AS DATE) AS day,
    COUNT(*)                    AS row_count,
    COUNT(DISTINCT col_int)     AS distinct_ints
FROM loading.data_type_coverage
WHERE col_timestamp >= TIMESTAMP '2024-01-01 00:00:00'
  AND col_timestamp <  TIMESTAMP '2024-02-01 00:00:00'
GROUP BY CAST(col_timestamp AS DATE)
ORDER BY day;

Query Geospatial Data

Inspect the geospatial columns loaded into the table for the first 10 rows using the LIMIT keyword.

SQL

SELECT
    col_point,
    col_linestring,
    col_polygon
FROM loading.data_type_coverage
WHERE col_point IS NOT NULL
LIMIT 10;

For details on geospatial functions and spatial queries, see Geospatial Functions.

Explore Data Type Distributions

Summarize the distinct values and NULL rates for each column type.

SQL

SELECT
    COUNT(DISTINCT col_boolean)  AS distinct_boolean,
    COUNT(DISTINCT col_tinyint)  AS distinct_tinyint,
    COUNT(DISTINCT col_smallint) AS distinct_smallint,
    COUNT(DISTINCT col_int)      AS distinct_int,
    COUNT(DISTINCT col_uuid)     AS distinct_uuid
FROM loading.data_type_coverage;

For a full reference of SQL functions, see Functions Overview. For tips on writing queries that leverage segment keys and indexes, see Query Performance Tuning.

Remove the Data Pipeline and Table

After you finish exploring the sample data, remove the data_type_coverage_pipeline pipeline.

SQL

DROP PIPELINE data_type_coverage_pipeline;

Remove the data_type_coverage table.

SQL

DROP TABLE loading.data_type_coverage;

Removing a pipeline removes the pipeline metadata from the system but does not affect the data already loaded into the target table. Whereas removing the table removes the data. Data Pipelines Tables Understanding Data Types SQL Reference

OcientAIQ Unified Data Platform

Documentation Index

​Prerequisites

​Step 1: Connect to the Ocient System

​Step 2: Create a Database and Schema

​Step 3: Create a Table

​Step 4: Create a Data Pipeline

​Use a Sample Data Set

​Create the Pipeline

​Step 5: Start and Monitor the Pipeline

​Step 6: Validate the Loaded Data

​Verify the Row Count

​Check the Time Range

​Inspect for NULL Values

​Sample Records

​Step 7: Query the Data

​Aggregate Numeric Data

​Filter by Time Range

​Query Geospatial Data

​Explore Data Type Distributions

​Remove the Data Pipeline and Table

​Related Links

Prerequisites

Step 1: Connect to the Ocient System

Step 2: Create a Database and Schema

Step 3: Create a Table

Step 4: Create a Data Pipeline

Use a Sample Data Set

Create the Pipeline

Step 5: Start and Monitor the Pipeline

Step 6: Validate the Loaded Data

Verify the Row Count

Check the Time Range

Inspect for NULL Values

Sample Records

Step 7: Query the Data

Aggregate Numeric Data

Filter by Time Range

Query Geospatial Data

Explore Data Type Distributions

Remove the Data Pipeline and Table

Related Links