Load and Analyze Data
this tutorial details the core workflow from start to finish you connect to the system, create a database and table, load sample data from a source, validate the loaded data, and execute analytic queries the example uses a sample data set that covers every supported ocient data type by the end of this process, you have a working table with loaded data that you can query immediately prerequisites the tutorial requires network access to the sql nodes in your ocient system and the ip address or hostname of at least one sql node a valid username and password with permissions to create databases, tables, and data pipelines you have installed the ocient jdbc driver and cli on your local machine for setup instructions, see connect using jdbc docid sfpgk k74g4uyommitwt if you do not have the jdbc driver installed, you can also connect using pyocient , the ocient driver for setup instructions, see connect using pyocient docid\ yk0ogpta5puer79xwb4bk step 1 connect to the ocient system start the jdbc cli and connect to a sql node replace the hostname \<hostname or ip> and port \<port> placeholders with the values for your system the default sql node port is 4050 connect to jdbc\ ocient //\<hostname or ip> \<port>/system; after a successful connection, you see the ocient cli prompt ocient> for details on connection methods and troubleshooting, see connect to ocient docid\ bz8g2ykkd26fmwpywjbuu step 2 create a database and schema create a database to hold your sample data using a dedicated database keeps your work isolated from other applications on the same system create database test; switch your connection to the new database connect to jdbc\ ocient //\<hostname or ip> \<port>/test; create a schema within the database to organize your tables create schema loading; for details on managing databases and schemas, see databases docid 5gexlsola5mpwhnhlm8zn and schemas docid\ afuz9fotopc8jqgletoat step 3 create a table create a table that covers all supported ocient data types this table serves as the target for the data pipeline in the next step the create table sql statement defines the column names, data types, and a on the col timestamp column in the data type coverage table the timekey partitions data by time so that the system can skip irrelevant segments during time filtered queries create table loading data type coverage ( col bigint bigint, col binary binary(2), col boolean boolean, col char char(64), col date date, col decimal decimal(18, 4), col double double, col float float, col int int, col int array int\[], col ipv4 ipv4, col ip ip, col smallint smallint, col point point, col linestring linestring, col polygon polygon, col time time, col timestamp timestamp time key bucket(1, day) not null, col tinyint tinyint, col tuple tuple<\<int, varchar(255)>> null, col uuid uuid, col varbinary varbinary, col varchar varchar ); this table intentionally keeps the schema simple to focus on the loading workflow in production, you should also define a clustering key and secondary indexes based on your expected query patterns for details on table design options, see timekeys and clustering keys docid\ tfr hznzvabrm8wqf46lm — segment keys that partition and order data for faster queries secondary indexes docid\ xmmylaxzqfci6ysnff5tg — additional indexes for columns used in filters table compression options docid 9dknfvlmwhmwtb3 pay09 — compression settings that reduce storage requirements create table sql statement examples docid m9x5vuwbb5i 09wfbmqv — examples of table definitions for different use cases step 4 create a data pipeline data pipelines are the way to load data into an ocient system each pipeline is a sql object that defines the source, data format, and transformations for loading rows into one or more tables for a full overview of pipeline concepts, see load data docid\ lk7xyhhwzkwj32rx8p v2 use a sample data set the sample data is a set of gzip compressed csv files hosted in a public s3 bucket each file contains one header row followed by data rows with columns covering every ocient data type create the pipeline this data pipeline reads csv files from the s3 source, parses each field using named headers, applies type conversions where needed, and loads the results into the data type coverage table create batch pipeline data type coverage pipeline source s3 endpoint 'https //s3 us east 1 amazonaws com' bucket 'ocient docs' prefix 'all data types/small/' filter glob ' output csv gz' compression method 'gzip' sort by 'filename' extract format delimited record delimiter e'\n' field delimiter ',' field optionally enclosed by '"' num header lines 1 headers \[ 'bigint', 'binary', 'boolean', 'char', 'date', 'decimal', 'double', 'float', 'int', 'int array', 'ipv4', 'ip', 'smallint', 'point', 'linestring', 'polygon', 'time', 'timestamp', 'tinyint', 'tuple', 'uuid', 'varbinary', 'varchar' ] open array '\[' close array ']' array element delimiter ',' open object '(' close object ')' into test loading data type coverage select $bigint as col bigint, $binary as col binary, $boolean as col boolean, $char as col char, $date as col date, $decimal as col decimal, double(if(lower($double) = 'nan',null, $double)) as col double, float(if(lower($float) = 'nan', null, $float)) as col float, $int as col int, int\[]\($int array) as col int array, $ipv4 as col ipv4, $ip as col ip, $smallint as col smallint, $point as col point, $linestring as col linestring, $polygon as col polygon, $time as col time, replace(replace($timestamp, 't', ' '), 'z', '') as col timestamp, $tinyint as col tinyint, $tuple as col tuple, $uuid as col uuid, varbinary(if($varbinary is null, null, if(length($varbinary) % 2 = 1, concat('0', $varbinary), $varbinary))) as col varbinary, $varchar as col varchar; the pipeline has three sections source — identifies the s3 bucket, file path prefix, and glob filter the compression method parameter indicates that the source files are gzip compressed extract — defines the csv format options, including delimiters, quoting, and the header names that the select clause references the open array and close array , and open object and close object parameters tell the parser how array and tuple values are encoded in the csv data select — maps each source field to a target column some fields require explicit type conversion for example, double($double) converts the extracted string to a double value, and the nested replace functions strip the t and z characters from timestamps to match the ocient timestamp format this varbinary expression pads odd length hex strings with a leading zero to ensure all rows that are not null convert to the varbinary data type for details on pipeline syntax and options, see data pipelines docid 0qxabe6vs9pbygbrfmnah for details on supported data formats, see data formats for data pipelines docid\ h0pp6 4pxhnxskywp7gt for details on data type handling during loading, see data types for data pipelines docid\ amyk3i4eghs4pvoj4l7iy for the functions, see transform data in data pipelines docid\ jhsjeme4obz0xpegzai0h step 5 start and monitor the pipeline start the data pipeline to begin loading data using the start pipeline sql statement start pipeline data type coverage pipeline; check the pipeline status while it runs show pipeline status; output database name pipeline name table names status percent complete records processed records loaded records failed \ test data type coverage pipeline \["loading data type coverage"] running 0 0 0 0 0 wait for the status column to show the completed status you can execute the show pipeline status statement again to check progress for larger data sets, the percent complete , records processed , and records loaded columns update as the pipeline progresses for more detailed monitoring options, including system catalog tables and metrics endpoints, see monitor data pipelines docid\ ekjywchkfqwf5cqdaqvyq if the pipeline enters the failed status, see manage errors in data pipelines docid\ im2hozackny8toakkbe6w and data pipeline loading errors docid\ kxgcs92vdau9ar2xbahjj step 6 validate the loaded data after the pipeline completes, verify that the data loaded correctly before you begin analysis verify the row count confirm that the number of loaded rows matches the pipeline metrics select count( ) as total rows from loading data type coverage; cross reference this count with the records loaded value from the show pipeline status statement check the time range verify that the col timestamp column covers the expected date range gaps in the range can indicate missing source files select min(col timestamp) as earliest, max(col timestamp) as latest, count(distinct cast(col timestamp as date)) as distinct days from loading data type coverage; inspect for null values check for unexpected null values in columns that should be populated select count( ) as total rows, count(col bigint) as non null bigint, count(col double) as non null double, count(col timestamp) as non null timestamp, count(col varchar) as non null varchar, count(col int array) as non null int array, count(col tuple) as non null tuple from loading data type coverage; if a column has significantly fewer non null values than expected, review the source data and the pipeline select transformations for parsing issues sample records review a small set of records to confirm that type conversions applied correctly using the limit keyword select from loading data type coverage limit 5; step 7 query the data with validated data in the table, you can begin running analytic queries these examples demonstrate common query patterns across different data types aggregate numeric data compute summary statistics on the numeric columns in this case, retrieve the count of rows, the minimum number of the col int column, the maximum number of the col int column, and the average of the numbers in the col smallint column select count( ) as total rows, min(col int) as min int, max(col int) as max int, avg(col smallint) as avg smallint from loading data type coverage; filter by time range use the col timestamp timekey column to filter data efficiently the ocient system skips segments outside the specified time range of january 2024 select cast(col timestamp as date) as day, count( ) as row count, count(distinct col int) as distinct ints from loading data type coverage where col timestamp >= timestamp '2024 01 01 00 00 00' and col timestamp < timestamp '2024 02 01 00 00 00' group by cast(col timestamp as date) order by day; query geospatial data inspect the geospatial columns loaded into the table for the first 10 rows using the limit keyword select col point, col linestring, col polygon from loading data type coverage where col point is not null limit 10; for details on geospatial functions and spatial queries, see geospatial functions docid\ c xku2wubvnuz7vck rnl explore data type distributions summarize the distinct values and null rates for each column type select count(distinct col boolean) as distinct boolean, count(distinct col tinyint) as distinct tinyint, count(distinct col smallint) as distinct smallint, count(distinct col int) as distinct int, count(distinct col uuid) as distinct uuid from loading data type coverage; for a full reference of sql functions, see functions overview docid\ mkdqt63 rqq9jg1ilixlg for tips on writing queries that leverage segment keys and indexes, see query performance tuning docid\ j0emntlrygy0vpzatoeff remove the data pipeline and table after you finish exploring the sample data, remove the data type coverage pipeline pipeline drop pipeline data type coverage pipeline; remove the data type coverage table drop table loading data type coverage; removing a pipeline removes the pipeline metadata from the system but does not affect the data already loaded into the target table whereas removing the table removes the data related links data pipelines docid 0qxabe6vs9pbygbrfmnah tables docid tp6ahq7papbrxirtqxyn understanding data types docid\ lkwvkbymwwvjhr60stg56 sql reference docid\ ntdu6jmn5tj143igbgvh