Time Keys, Clustering Keys, and Indexes

This guide provides an overview of the roles of different types of keys and indexes in 
﻿. To illustrate these concepts, this guide will refer to the following example table representing users interacting with applications:
event_time
event_id
location
…
user_id (CK1)
app_id (CK2)
device_id (CK3)
9031994
93-125-3235
Indianapolis
…
918
8334
09-325
9032034
58-392-0921
Chicago
…
918
8335
88-042
9231994
48-132-0932
Cincinnati
…
1312
9115
13-743
12252021
11-054-8754
Chicago
…
1315
401
88-042
Keys in Ocient
The first engine developed for the database is a time cluster storage engine. It allows for very high retrieval rates of time clustered data. Data is stored in a unit of storage called “segments” along with metadata that is used by queries to optimize performance. To understand the tradeoffs when setting up a table it is first beneficial to understand how the data is retrieved when Ocient processes a query.
Step One - TimeKey
Most queries will include some level of time filtering (on the column chosen to be the 
﻿). Ocient begins with coarse filtering on this column (event_time in the table) because each segment incorporates a min/max time value for all rows stored in it. Using this, Ocient can exclude segments without issuing any I/O on the data.
Step Two - The Cluster Key
After the initial time filtering step Ocient takes advantage of the clustered nature of the data on disk to isolate records further. The Cluster Key (CK) physically orders data so Ocient can locate any records that have the same key value very quickly. For this reason, one can only create a single group of cluster key columns. In the example table, note the first cluster key column CK1 (user_id) is ordered, and the paired cluster key column CK2 (app_id) is ordered for each matching value in CK1 (user_id) (i.e. because there are two records with user_id 918, the row containing app_id 8334 is ordered before the row containing app_id 8335).
Choosing cluster key columns that are heavily utilized in queries is pivotal to optimizing query performance.
When you define the cluster key index, you must specify columns in a contiguous and ascending order. The columns must be specified in the order of their location and order in the definition of the CREATE TABLE SQL statement.
By filtering on the chosen TimeKey and reading clustered data off of disk using Cluster Key Columns, Ocient can quickly retrieve rows for further query processing.
Indexes in Ocient
In addition to these keys, there are other indexes a user can create on a table to improve performance:
Secondary Index
Secondary indexes allow for improved performance when queries do not use the Clustering Key (CK). With a Clustering Key (CK1, CK2, CK3), the primary CK index can be used to efficiently satisfy query filters including CK1, (CK1,CK2), or (CK1,CK2,CK3) but not subsets that don’t include CK1 like (CK2,CK3) or (CK3,CK2). Secondary CK indexes make this possible. The ordering of these indexes is typically based on the column combinations expected in queries, taking clustering and cardinality knowledge of the data into account to create the most efficient indexes.
Secondary indexes do NOT change the way data is clustered or stored on disk. Instead they build a new mapping to data locations already on disk. As a result, data will often be read across many different segments, as opposed to consolidating reads on large spans of disk based on the Clustering Key. Consequently, secondary indexes are less efficient than using the primary cluster key index in most situations.
Two types of secondary indexes are available, depending on the column(s) on which they’re created:
Cluster Key derived secondary indexes (these are referred to as “Secondary Cluster Key Indexes”)
Non-Cluster Key column secondary indexes
Secondary Cluster Key Indexes
Secondary Cluster Key Indexes can be derived from the Clustering Key index by either reordering the columns in the Clustering Key or using a subset of the Clustering Key columns. These indexes can only have columns that are in the primary cluster key index.
Reordered Clustering Key
While Clustering Keys define how the data is clustered on disk, a Secondary Cluster Key Index can improve query performance when data must be traversed in a different order than the Clustering Key. For example, if you defined a Secondary Cluster Key Index on the table as (CK3, CK1, CK2) all the device_id’s could be traversed in order. This allows for better query performance on queries that filter for specific device_ids and user_ids but are not restricting CK1 (app_id). Because Ocient will have an index indicating which segments contain data for specific values such as device_id == '88-042' AND user_id >= 918, Ocient can skip reading unmatched segments and deliver faster performance.
Subset of Clustering Key Columns
A user could also define a new Secondary Cluster Key Index by taking a subset containing just CK2 (app_id) if traversing the data on only the app_id, without CK1 (user_id) is used frequently in queries.
When to use a subset vs. a reordering of the cluster key columns to create an index:
A reordering of the cluster keys can be beneficial if the query filtering and selection usually contains all cluster keys but the entire cluster key is not used for filtering. For example, if queries frequently filter device_id and user_id at the same time and you want to retrieve the app_id, you could leverage a secondary cluster key index of (CK3, CK1, CK2). This is the case because you first can look at which rows have the required device_id values in your filter, which gives the user_id values for those device_ids. Then the result set includes the app_id values for these user_id values.
Similar reasoning applies to the use of a subset. If queries are usually only highly selective on some smaller set of the cluster keys, then a subset of the cluster key index would be preferred. For example, if there exists a large subset of queries where you do not filter on the app_id, but you still frequently filter on device_id and user_id to get location values, then you would want to define a secondary cluster key index of (CK3, CK1). This would grant us the ability to "skip" CK2 and parse the rows after CK1 directly.
Non-Cluster Key Column Secondary Indexes
Ocient has a few other types of secondary indexes that can be built on columns not in the clustering key. The index name and the column list are supplied when creating a secondary index, but the index type and index parameters are not specified by the user.
Instead, as data is loaded, the index type and index parameters are chosen automatically on a per-segment basis depending on the statistics of the data that will be indexed.
For requirements on supported column types and DDL syntax, see the Secondary Indexes﻿ page. 
The Create Table Statement
You can use this CREATE TABLE SQL statement to create the table.
SQL
CREATE TABLE example.example_table (
  event_time TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
  event_id VARCHAR(255) NOT NULL,
  location VARCHAR(255),
  ...,
  user_id INT NOT NULL,
  app_id INT NOT NULL,
  device_id BINARY(20) NOT NULL,
  CLUSTERING INDEX primary_index (user_id, app_id, device_id),
  INDEX idx01 (device_id, user_id, app_id),
  INDEX idx02 (app_id)
);
CREATE TABLE example.example_table (
  event_time TIMESTAMP TIME KEY BUCKET(1, DAY) NOT NULL,
  event_id VARCHAR(255) NOT NULL,
  location VARCHAR(255),
  ...,
  user_id INT NOT NULL,
  app_id INT NOT NULL,
  device_id BINARY(20) NOT NULL,
  CLUSTERING INDEX primary_index (user_id, app_id, device_id),
  INDEX idx01 (device_id, user_id, app_id),
  INDEX idx02 (app_id)
);
﻿
event_time is the TimeKey. BUCKET(1, DAY) sets the granularity of the time bucket to a fixed-width of one day. For more information on time buckets, see the CREATE TABLE﻿ section.  
primary_index: the primary cluster key
idx01: a secondary cluster key index (a reordering of the primary cluster key index)
idx02: a secondary cluster key index (a subset of the cluster keys)
Permissions
To create a table, you must possess the CREATE TABLE privilege for the current database.
The cluster-key columns must be defined adjacent to one another in the schema.
Indexes Should Not Be Defined Everywhere
Memory overhead: Indexes require significant space on every segment. For example, to create an inverted secondary index on a segment, a vector of row numbers must be created for each distinct value in the segment. This overhead grows quickly.
Loading overhead: Indexes are created during segment generation, so loading time is heavily impacted by the number of indexes that must be built. 
Indexes are built during segment generation, so old segments do not include index data for newly created indexes. Indexes are only built for data loaded after the index creation.
Observability
System tables exist to inspect details of indexes for tables:
sys.indexes: id, name, index_type, table_id, index_use
sys.index_columns: index_id, column_id, ordinal, ascending, column_ordinal
sys.segment_parts: segment_group_id, ida_offset, name, part_type, size (can see the size of index segment parts), segment_ida_offset, segment_lba_offset
Read more about the system catalog in System Catalog Reference﻿ documentation.
Related Links
﻿Load Data﻿﻿
﻿Query Ocient﻿﻿
﻿CREATE TABLE﻿﻿
Related Videos
﻿At the Whiteboard with Ocient: Indexing at Hyperscale﻿
event_time	event_id	location	…	user_id (CK1)	app_id (CK2)	device_id (CK3)
9031994	93-125-3235	Indianapolis	…	918	8334	09-325
9032034	58-392-0921	Chicago	…	918	8335	88-042
9231994	48-132-0932	Cincinnati	…	1312	9115	13-743
12252021	11-054-8754	Chicago	…	1315	401	88-042
Updated 20 May 2024
Did this page help you?
Authentication Methods
Secondary Indexes