> ## Documentation Index > Fetch the complete documentation index at: https://docs.ocient.com/llms.txt > Use this file to discover all available pages before exploring further. # Load and Analyze Data export const TimeKey = "TimeKey®"; export const Python = "Python®"; export const Ocient = "Ocient®"; export const AWS = "Amazon® Web Services℠ (AWS℠)"; This tutorial details the core {Ocient} workflow from start to finish. You connect to the system, create a database and table, load sample data from a source, validate the loaded data, and execute analytic queries. The example uses a sample data set that covers every supported Ocient data type. By the end of this process, you have a working table with loaded data that you can query immediately. ## Prerequisites The tutorial requires: * Network access to the SQL Nodes in your Ocient System and the IP address or hostname of at least one SQL Node. * A valid username and password with permissions to create databases, tables, and data pipelines. * You have installed the Ocient JDBC driver and CLI on your local machine. For setup instructions, see [Connect Using JDBC](/connect-using-jdbc). If you do not have the JDBC driver installed, you can also connect using `pyocient`, the Ocient {Python} driver. For setup instructions, see [Connect Using pyocient](/connect-using-pyocient). ## Step 1: Connect to the Ocient System Start the JDBC CLI and connect to a SQL Node. Replace the hostname `` and port `` placeholders with the values for your system. The default SQL Node port is `4050`. ```sql SQL theme={null} CONNECT TO jdbc:ocient://:/system; ``` After a successful connection, you see the Ocient CLI prompt. ```shell Shell theme={null} Ocient> _ ``` For details on connection methods and troubleshooting, see [Connect to Ocient](/connect-to-ocient). ## Step 2: Create a Database and Schema Create a database to hold your sample data. Using a dedicated database keeps your work isolated from other applications on the same system. ```sql SQL theme={null} CREATE DATABASE test; ``` Switch your connection to the new database. ```sql SQL theme={null} CONNECT TO jdbc:ocient://:/test; ``` Create a schema within the database to organize your tables. ```sql SQL theme={null} CREATE SCHEMA loading; ``` For details on managing databases and schemas, see [Databases](/databases) and [Schemas](/schemas). ## Step 3: Create a Table Create a table that covers all supported Ocient data types. This table serves as the target for the data pipeline in the next step. The `CREATE TABLE` SQL statement defines the column names, data types, and a {TimeKey} on the `col_timestamp` column in the `data_type_coverage` table. The TimeKey partitions data by time so that the system can skip irrelevant segments during time-filtered queries. ```sql SQL theme={null} CREATE TABLE loading.data_type_coverage ( col_bigint BIGINT, col_binary BINARY(2), col_boolean BOOLEAN, col_char CHAR(64), col_date DATE, col_decimal DECIMAL(18, 4), col_double DOUBLE, col_float FLOAT, col_int INT, col_int_array INT[], col_ipv4 IPV4, col_ip IP, col_smallint SMALLINT, col_point POINT, col_linestring LINESTRING, col_polygon POLYGON, col_time TIME, col_timestamp TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL, col_tinyint TINYINT, col_tuple TUPLE<> NULL, col_uuid UUID, col_varbinary VARBINARY, col_varchar VARCHAR ); ``` This table intentionally keeps the schema simple to focus on the loading workflow. In production, you should also define a Clustering Key and secondary indexes based on your expected query patterns. For details on table design options, see: * [TimeKeys and Clustering Keys](/timekeys-and-clustering-keys) — Segment keys that partition and order data for faster queries. * [Secondary Indexes](/secondary-indexes) — Additional indexes for columns used in filters. * [Table Compression Options](/table-compression-options) — Compression settings that reduce storage requirements. * [CREATE TABLE SQL Statement Examples](/create-table-sql-statement-examples) — Examples of table definitions for different use cases. ## Step 4: Create a Data Pipeline Data pipelines are the way to load data into an Ocient System. Each pipeline is a SQL object that defines the source, data format, and transformations for loading rows into one or more tables. For a full overview of pipeline concepts, see [Load Data](/load-data). ### Use a Sample Data Set The sample data is a set of gzip-compressed CSV files hosted in a public {AWS} S3 bucket. Each file contains one header row followed by data rows with columns covering every Ocient data type. ### Create the Pipeline This data pipeline reads CSV files from the S3 source, parses each field using named headers, applies type conversions where needed, and loads the results into the `data_type_coverage` table. ```sql SQL theme={null} CREATE BATCH PIPELINE data_type_coverage_pipeline SOURCE S3 ENDPOINT 'https://s3.us-east-1.amazonaws.com' BUCKET 'ocient-docs' PREFIX 'all_data_types/small/' FILTER_GLOB '**output*.csv.gz' COMPRESSION_METHOD 'gzip' SORT_BY 'filename' EXTRACT FORMAT delimited RECORD_DELIMITER e'\n' FIELD_DELIMITER ',' FIELD_OPTIONALLY_ENCLOSED_BY '"' NUM_HEADER_LINES 1 HEADERS [ 'bigint', 'binary', 'boolean', 'char', 'date', 'decimal', 'double', 'float', 'int', 'int_array', 'ipv4', 'ip', 'smallint', 'point', 'linestring', 'polygon', 'time', 'timestamp', 'tinyint', 'tuple', 'uuid', 'varbinary', 'varchar' ] OPEN_ARRAY '[' CLOSE_ARRAY ']' ARRAY_ELEMENT_DELIMITER ',' OPEN_OBJECT '(' CLOSE_OBJECT ')' INTO test.loading.data_type_coverage SELECT $bigint AS col_bigint, $binary AS col_binary, $boolean AS col_boolean, $char AS col_char, $date AS col_date, $decimal AS col_decimal, DOUBLE(IF(LOWER($double) = 'nan',NULL, $double)) AS col_double, FLOAT(IF(LOWER($float) = 'nan', NULL, $float)) AS col_float, $int AS col_int, INT[]($int_array) AS col_int_array, $ipv4 AS col_ipv4, $ip AS col_ip, $smallint AS col_smallint, $point AS col_point, $linestring AS col_linestring, $polygon AS col_polygon, $time AS col_time, REPLACE(REPLACE($timestamp, 'T', ' '), 'Z', '') AS col_timestamp, $tinyint AS col_tinyint, $tuple AS col_tuple, $uuid AS col_uuid, VARBINARY(IF($varbinary IS NULL, NULL, IF(LENGTH($varbinary) % 2 = 1, CONCAT('0', $varbinary), $varbinary))) AS col_varbinary, $varchar AS col_varchar; ``` The pipeline has three sections: * `SOURCE` — Identifies the S3 bucket, file path prefix, and glob filter. The `COMPRESSION_METHOD` parameter indicates that the source files are gzip-compressed. * `EXTRACT` — Defines the CSV format options, including delimiters, quoting, and the header names that the `SELECT` clause references. The `OPEN_ARRAY` and `CLOSE_ARRAY`, and `OPEN_OBJECT` and `CLOSE_OBJECT` parameters tell the parser how array and tuple values are encoded in the CSV data. * `SELECT` — Maps each source field to a target column. Some fields require explicit type conversion. For example, `DOUBLE($double)` converts the extracted string to a `DOUBLE` value, and the nested `REPLACE` functions strip the `T` and `Z` characters from timestamps to match the Ocient `TIMESTAMP` format. This `VARBINARY` expression pads odd-length hex strings with a leading zero to ensure all rows that are not `NULL` convert to the `VARBINARY` data type. For details on pipeline syntax and options, see [Data Pipelines](/data-pipelines). For details on supported data formats, see [Data Formats for Data Pipelines](/data-formats-for-data-pipelines). For details on data type handling during loading, see [Data Types for Data Pipelines](/data-types-for-data-pipelines). For the functions, see [Transform Data in Data Pipelines](/transform-data-in-data-pipelines). ## Step 5: Start and Monitor the Pipeline Start the data pipeline to begin loading data using the `START PIPELINE` SQL statement. ```sql SQL theme={null} START PIPELINE data_type_coverage_pipeline; ``` Check the pipeline status while it runs. ```sql SQL theme={null} SHOW PIPELINE_STATUS; ``` Output ```none Text theme={null} database_name pipeline_name table_names status percent_complete records_processed records_loaded records_failed -------------- ------------------------------ --------------------------------- -------- ---------------- ------------------ --------------- --------------- test data_type_coverage_pipeline ["loading.data_type_coverage"] RUNNING 0.0 0 0 0 ``` Wait for the `status` column to show the `COMPLETED` status. You can execute the `SHOW PIPELINE_STATUS` statement again to check progress. For larger data sets, the `percent_complete`, `records_processed`, and `records_loaded` columns update as the pipeline progresses. For more detailed monitoring options, including system catalog tables and metrics endpoints, see [Monitor Data Pipelines](/monitor-data-pipelines). If the pipeline enters the `FAILED` status, see [Manage Errors in Data Pipelines](/manage-errors-in-data-pipelines) and [Data Pipeline Loading Errors](/data-pipeline-loading-errors). ## Step 6: Validate the Loaded Data After the pipeline completes, verify that the data loaded correctly before you begin analysis. ### Verify the Row Count Confirm that the number of loaded rows matches the pipeline metrics. ```sql SQL theme={null} SELECT COUNT(*) AS total_rows FROM loading.data_type_coverage; ``` Cross-reference this count with the `records_loaded` value from the `SHOW PIPELINE_STATUS` statement. ### Check the Time Range Verify that the `col_timestamp` column covers the expected date range. Gaps in the range can indicate missing source files. ```sql SQL theme={null} SELECT MIN(col_timestamp) AS earliest, MAX(col_timestamp) AS latest, COUNT(DISTINCT CAST(col_timestamp AS DATE)) AS distinct_days FROM loading.data_type_coverage; ``` ### Inspect for NULL Values Check for unexpected NULL values in columns that should be populated. ```sql SQL theme={null} SELECT COUNT(*) AS total_rows, COUNT(col_bigint) AS non_null_bigint, COUNT(col_double) AS non_null_double, COUNT(col_timestamp) AS non_null_timestamp, COUNT(col_varchar) AS non_null_varchar, COUNT(col_int_array) AS non_null_int_array, COUNT(col_tuple) AS non_null_tuple FROM loading.data_type_coverage; ``` If a column has significantly fewer non-NULL values than expected, review the source data and the pipeline `SELECT` transformations for parsing issues. ### Sample Records Review a small set of records to confirm that type conversions applied correctly using the `LIMIT` keyword. ```sql SQL theme={null} SELECT * FROM loading.data_type_coverage LIMIT 5; ``` ## Step 7: Query the Data With validated data in the table, you can begin running analytic queries. These examples demonstrate common query patterns across different data types. ### Aggregate Numeric Data Compute summary statistics on the numeric columns. In this case, retrieve the count of rows, the minimum number of the `col_int` column, the maximum number of the `col_int` column, and the average of the numbers in the `col_smallint` column. ```sql SQL theme={null} SELECT COUNT(*) AS total_rows, MIN(col_int) AS min_int, MAX(col_int) AS max_int, AVG(col_smallint) AS avg_smallint FROM loading.data_type_coverage; ``` ### Filter by Time Range Use the `col_timestamp` TimeKey column to filter data efficiently. The Ocient System skips segments outside the specified time range of January 2024. ```sql SQL theme={null} SELECT CAST(col_timestamp AS DATE) AS day, COUNT(*) AS row_count, COUNT(DISTINCT col_int) AS distinct_ints FROM loading.data_type_coverage WHERE col_timestamp >= TIMESTAMP '2024-01-01 00:00:00' AND col_timestamp < TIMESTAMP '2024-02-01 00:00:00' GROUP BY CAST(col_timestamp AS DATE) ORDER BY day; ``` ### Query Geospatial Data Inspect the geospatial columns loaded into the table for the first 10 rows using the `LIMIT` keyword. ```sql SQL theme={null} SELECT col_point, col_linestring, col_polygon FROM loading.data_type_coverage WHERE col_point IS NOT NULL LIMIT 10; ``` For details on geospatial functions and spatial queries, see [Geospatial Functions](/geospatial-functions). ### Explore Data Type Distributions Summarize the distinct values and NULL rates for each column type. ```sql SQL theme={null} SELECT COUNT(DISTINCT col_boolean) AS distinct_boolean, COUNT(DISTINCT col_tinyint) AS distinct_tinyint, COUNT(DISTINCT col_smallint) AS distinct_smallint, COUNT(DISTINCT col_int) AS distinct_int, COUNT(DISTINCT col_uuid) AS distinct_uuid FROM loading.data_type_coverage; ``` For a full reference of SQL functions, see [Functions Overview](/functions-overview). For tips on writing queries that leverage segment keys and indexes, see [Query Performance Tuning](/query-performance-tuning). ### Remove the Data Pipeline and Table After you finish exploring the sample data, remove the `data_type_coverage_pipeline` pipeline. ```sql SQL theme={null} DROP PIPELINE data_type_coverage_pipeline; ``` Remove the `data_type_coverage` table. ```sql SQL theme={null} DROP TABLE loading.data_type_coverage; ``` Removing a pipeline removes the pipeline metadata from the system but does not affect the data already loaded into the target table. Whereas removing the table removes the data. ## Related Links [Data Pipelines](/data-pipelines) [Tables](/tables) [Understanding Data Types](/understanding-data-types) [SQL Reference](/sql-reference)