> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Load and Analyze Data

export const TimeKey = "TimeKey®";

export const Python = "Python®";

export const Ocient = "Ocient®";

export const AWS = "Amazon® Web Services℠ (AWS℠)";

This tutorial details the core {Ocient} workflow from start to finish. You connect to the system, create a database and table, load sample data from a source, validate the loaded data, and execute analytic queries.

The example uses a sample data set that covers every supported Ocient data type. By the end of this process, you have a working table with loaded data that you can query immediately.

## Prerequisites

The tutorial requires:

* Network access to the SQL Nodes in your Ocient System and the IP address or hostname of at least one SQL Node.
* A valid username and password with permissions to create databases, tables, and data pipelines.
* You have installed the Ocient JDBC driver and CLI on your local machine. For setup instructions, see [Connect Using JDBC](/connect-using-jdbc).

<Info>
  If you do not have the JDBC driver installed, you can also connect using `pyocient`, the Ocient {Python} driver. For setup instructions, see [Connect Using pyocient](/connect-using-pyocient).
</Info>

## Step 1: Connect to the Ocient System

Start the JDBC CLI and connect to a SQL Node. Replace the hostname `<HOSTNAME_OR_IP>` and port `<PORT>` placeholders with the values for your system. The default SQL Node port is `4050`.

```sql SQL theme={null}
CONNECT TO jdbc:ocient://<HOSTNAME_OR_IP>:<PORT>/system;
```

After a successful connection, you see the Ocient CLI prompt.

```shell Shell theme={null}
Ocient> _
```

For details on connection methods and troubleshooting, see [Connect to Ocient](/connect-to-ocient).

## Step 2: Create a Database and Schema

Create a database to hold your sample data. Using a dedicated database keeps your work isolated from other applications on the same system.

```sql SQL theme={null}
CREATE DATABASE test;
```

Switch your connection to the new database.

```sql SQL theme={null}
CONNECT TO jdbc:ocient://<HOSTNAME_OR_IP>:<PORT>/test;
```

Create a schema within the database to organize your tables.

```sql SQL theme={null}
CREATE SCHEMA loading;
```

For details on managing databases and schemas, see [Databases](/databases) and [Schemas](/schemas).

## Step 3: Create a Table

Create a table that covers all supported Ocient data types. This table serves as the target for the data pipeline in the next step.

The `CREATE TABLE` SQL statement defines the column names, data types, and a {TimeKey} on the `col_timestamp` column in the `data_type_coverage` table. The TimeKey partitions data by time so that the system can skip irrelevant segments during time-filtered queries.

```sql SQL theme={null}
CREATE TABLE loading.data_type_coverage (
    col_bigint     BIGINT,
    col_binary     BINARY(2),
    col_boolean    BOOLEAN,
    col_char       CHAR(64),
    col_date       DATE,
    col_decimal    DECIMAL(18, 4),
    col_double     DOUBLE,
    col_float      FLOAT,
    col_int        INT,
    col_int_array  INT[],
    col_ipv4       IPV4,
    col_ip         IP,
    col_smallint   SMALLINT,
    col_point      POINT,
    col_linestring LINESTRING,
    col_polygon    POLYGON,
    col_time       TIME,
    col_timestamp  TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
    col_tinyint    TINYINT,
    col_tuple      TUPLE<<INT, VARCHAR(255)>> NULL,
    col_uuid       UUID,
    col_varbinary  VARBINARY,
    col_varchar    VARCHAR
);
```

This table intentionally keeps the schema simple to focus on the loading workflow. In production, you should also define a Clustering Key and secondary indexes based on your expected query patterns. For details on table design options, see:

* [TimeKeys and Clustering Keys](/timekeys-and-clustering-keys) — Segment keys that partition and order data for faster queries.
* [Secondary Indexes](/secondary-indexes) — Additional indexes for columns used in filters.
* [Table Compression Options](/table-compression-options) — Compression settings that reduce storage requirements.
* [CREATE TABLE SQL Statement Examples](/create-table-sql-statement-examples) — Examples of table definitions for different use cases.

## Step 4: Create a Data Pipeline

Data pipelines are the way to load data into an Ocient System. Each pipeline is a SQL object that defines the source, data format, and transformations for loading rows into one or more tables. For a full overview of pipeline concepts, see [Load Data](/load-data).

### Use a Sample Data Set

The sample data is a set of gzip-compressed CSV files hosted in a public {AWS} S3 bucket. Each file contains one header row followed by data rows with columns covering every Ocient data type.

### Create the Pipeline

This data pipeline reads CSV files from the S3 source, parses each field using named headers, applies type conversions where needed, and loads the results into the `data_type_coverage` table.

```sql SQL theme={null}
CREATE BATCH PIPELINE data_type_coverage_pipeline
    SOURCE S3
        ENDPOINT 'https://s3.us-east-1.amazonaws.com'
        BUCKET 'ocient-docs'
        PREFIX 'all_data_types/small/'
        FILTER_GLOB '**output*.csv.gz'
        COMPRESSION_METHOD 'gzip'
        SORT_BY 'filename'
    EXTRACT
        FORMAT delimited
        RECORD_DELIMITER e'\n'
        FIELD_DELIMITER ','
        FIELD_OPTIONALLY_ENCLOSED_BY '"'
        NUM_HEADER_LINES 1
        HEADERS [
            'bigint', 'binary', 'boolean', 'char',
            'date', 'decimal', 'double', 'float',
            'int', 'int_array', 'ipv4', 'ip',
            'smallint', 'point', 'linestring', 'polygon',
            'time', 'timestamp', 'tinyint', 'tuple',
            'uuid', 'varbinary', 'varchar'
        ]
        OPEN_ARRAY '['
        CLOSE_ARRAY ']'
        ARRAY_ELEMENT_DELIMITER ','
        OPEN_OBJECT '('
        CLOSE_OBJECT ')'
INTO test.loading.data_type_coverage
SELECT
    $bigint                                          AS col_bigint,
    $binary                                          AS col_binary,
    $boolean                                         AS col_boolean,
    $char                                            AS col_char,
    $date                                            AS col_date,
    $decimal                                         AS col_decimal,
    DOUBLE(IF(LOWER($double) = 'nan',NULL, $double)) AS col_double,
    FLOAT(IF(LOWER($float) = 'nan', NULL, $float))   AS col_float,
    $int                                             AS col_int,
    INT[]($int_array)                                AS col_int_array,
    $ipv4                                            AS col_ipv4,
    $ip                                              AS col_ip,
    $smallint                                        AS col_smallint,
    $point                                           AS col_point,
    $linestring                                      AS col_linestring,
    $polygon                                         AS col_polygon,
    $time                                            AS col_time,
    REPLACE(REPLACE($timestamp, 'T', ' '), 'Z', '')  AS col_timestamp,
    $tinyint                                         AS col_tinyint,
    $tuple                                           AS col_tuple,
    $uuid                                            AS col_uuid,
    VARBINARY(IF($varbinary IS NULL, NULL,
        IF(LENGTH($varbinary) % 2 = 1, CONCAT('0', $varbinary), $varbinary)))
        											 AS col_varbinary,
    $varchar                                         AS col_varchar;
```

The pipeline has three sections:

* `SOURCE` — Identifies the S3 bucket, file path prefix, and glob filter. The `COMPRESSION_METHOD` parameter indicates that the source files are gzip-compressed.
* `EXTRACT` — Defines the CSV format options, including delimiters, quoting, and the header names that the `SELECT` clause references. The `OPEN_ARRAY` and `CLOSE_ARRAY`, and `OPEN_OBJECT` and `CLOSE_OBJECT` parameters tell the parser how array and tuple values are encoded in the CSV data.
* `SELECT` — Maps each source field to a target column. Some fields require explicit type conversion. For example, `DOUBLE($double)` converts the extracted string to a `DOUBLE` value, and the nested `REPLACE` functions strip the `T` and `Z` characters from timestamps to match the Ocient `TIMESTAMP` format. This `VARBINARY` expression pads odd-length hex strings with a leading zero to ensure all rows that are not `NULL` convert to the `VARBINARY` data type.

For details on pipeline syntax and options, see [Data Pipelines](/data-pipelines). For details on supported data formats, see [Data Formats for Data Pipelines](/data-formats-for-data-pipelines). For details on data type handling during loading, see [Data Types for Data Pipelines](/data-types-for-data-pipelines). For the functions, see [Transform Data in Data Pipelines](/transform-data-in-data-pipelines).

## Step 5: Start and Monitor the Pipeline

Start the data pipeline to begin loading data using the `START PIPELINE` SQL statement.

```sql SQL theme={null}
START PIPELINE data_type_coverage_pipeline;
```

Check the pipeline status while it runs.

```sql SQL theme={null}
SHOW PIPELINE_STATUS;
```

Output

```none Text theme={null}
database_name  pipeline_name                  table_names                       status   percent_complete  records_processed  records_loaded  records_failed
-------------- ------------------------------ --------------------------------- -------- ---------------- ------------------ --------------- ---------------
test           data_type_coverage_pipeline    ["loading.data_type_coverage"]    RUNNING  0.0               0                  0               0
```

Wait for the `status` column to show the `COMPLETED` status. You can execute the `SHOW PIPELINE_STATUS` statement again to check progress.

For larger data sets, the `percent_complete`, `records_processed`, and `records_loaded` columns update as the pipeline progresses.

For more detailed monitoring options, including system catalog tables and metrics endpoints, see [Monitor Data Pipelines](/monitor-data-pipelines).

If the pipeline enters the `FAILED` status, see [Manage Errors in Data Pipelines](/manage-errors-in-data-pipelines) and [Data Pipeline Loading Errors](/data-pipeline-loading-errors).

## Step 6: Validate the Loaded Data

After the pipeline completes, verify that the data loaded correctly before you begin analysis.

### Verify the Row Count

Confirm that the number of loaded rows matches the pipeline metrics.

```sql SQL theme={null}
SELECT COUNT(*) AS total_rows
FROM loading.data_type_coverage;
```

Cross-reference this count with the `records_loaded` value from the `SHOW PIPELINE_STATUS` statement.

### Check the Time Range

Verify that the `col_timestamp` column covers the expected date range. Gaps in the range can indicate missing source files.

```sql SQL theme={null}
SELECT
    MIN(col_timestamp) AS earliest,
    MAX(col_timestamp) AS latest,
    COUNT(DISTINCT CAST(col_timestamp AS DATE)) AS distinct_days
FROM loading.data_type_coverage;
```

### Inspect for NULL Values

Check for unexpected NULL values in columns that should be populated.

```sql SQL theme={null}
SELECT
    COUNT(*)             AS total_rows,
    COUNT(col_bigint)    AS non_null_bigint,
    COUNT(col_double)    AS non_null_double,
    COUNT(col_timestamp) AS non_null_timestamp,
    COUNT(col_varchar)   AS non_null_varchar,
    COUNT(col_int_array) AS non_null_int_array,
    COUNT(col_tuple)     AS non_null_tuple
FROM loading.data_type_coverage;
```

If a column has significantly fewer non-NULL values than expected, review the source data and the pipeline `SELECT` transformations for parsing issues.

### Sample Records

Review a small set of records to confirm that type conversions applied correctly using the `LIMIT` keyword.

```sql SQL theme={null}
SELECT *
FROM loading.data_type_coverage
LIMIT 5;
```

## Step 7: Query the Data

With validated data in the table, you can begin running analytic queries. These examples demonstrate common query patterns across different data types.

### Aggregate Numeric Data

Compute summary statistics on the numeric columns. In this case, retrieve the count of rows, the minimum number of the `col_int` column, the maximum number of the `col_int` column, and the average of the numbers in the `col_smallint` column.

```sql SQL theme={null}
SELECT
    COUNT(*)           AS total_rows,
    MIN(col_int)       AS min_int,
    MAX(col_int)       AS max_int,
    AVG(col_smallint)  AS avg_smallint
FROM loading.data_type_coverage;
```

### Filter by Time Range

Use the `col_timestamp` TimeKey column to filter data efficiently. The Ocient System skips segments outside the specified time range of January 2024.

```sql SQL theme={null}
SELECT
    CAST(col_timestamp AS DATE) AS day,
    COUNT(*)                    AS row_count,
    COUNT(DISTINCT col_int)     AS distinct_ints
FROM loading.data_type_coverage
WHERE col_timestamp >= TIMESTAMP '2024-01-01 00:00:00'
  AND col_timestamp <  TIMESTAMP '2024-02-01 00:00:00'
GROUP BY CAST(col_timestamp AS DATE)
ORDER BY day;
```

### Query Geospatial Data

Inspect the geospatial columns loaded into the table for the first 10 rows using the `LIMIT` keyword.

```sql SQL theme={null}
SELECT
    col_point,
    col_linestring,
    col_polygon
FROM loading.data_type_coverage
WHERE col_point IS NOT NULL
LIMIT 10;
```

For details on geospatial functions and spatial queries, see [Geospatial Functions](/geospatial-functions).

### Explore Data Type Distributions

Summarize the distinct values and NULL rates for each column type.

```sql SQL theme={null}
SELECT
    COUNT(DISTINCT col_boolean)  AS distinct_boolean,
    COUNT(DISTINCT col_tinyint)  AS distinct_tinyint,
    COUNT(DISTINCT col_smallint) AS distinct_smallint,
    COUNT(DISTINCT col_int)      AS distinct_int,
    COUNT(DISTINCT col_uuid)     AS distinct_uuid
FROM loading.data_type_coverage;
```

For a full reference of SQL functions, see [Functions Overview](/functions-overview). For tips on writing queries that leverage segment keys and indexes, see [Query Performance Tuning](/query-performance-tuning).

### Remove the Data Pipeline and Table

After you finish exploring the sample data, remove the `data_type_coverage_pipeline` pipeline.

```sql SQL theme={null}
DROP PIPELINE data_type_coverage_pipeline;
```

Remove the `data_type_coverage` table.

```sql SQL theme={null}
DROP TABLE loading.data_type_coverage;
```

Removing a pipeline removes the pipeline metadata from the system but does not affect the data already loaded into the target table. Whereas removing the table removes the data.

## Related Links

[Data Pipelines](/data-pipelines)

[Tables](/tables)

[Understanding Data Types](/understanding-data-types)

[SQL Reference](/sql-reference)
