Database Administration
Schema Design

Time Keys, Clustering Keys, and Indexes

This guide provides an overview of the roles of different types of keys and indexes in . To illustrate these concepts, this guide will refer to the following example table representing users interacting with applications:

event_time

event_id

location

user_id (CK1)

app_id (CK2)

device_id (CK3)

9031994

93-125-3235

Indianapolis

918

8334

09-325

9032034

58-392-0921

Chicago

918

8335

88-042

9231994

48-132-0932

Cincinnati

1312

9115

13-743

12252021

11-054-8754

Chicago

1315

401

88-042

Keys in Ocient

The first engine developed for the database is a time cluster storage engine. It allows for very high retrieval rates of time clustered data. Data is stored in a unit of storage called “segments” along with metadata that is used by queries to optimize performance. To understand the tradeoffs when setting up a table it is first beneficial to understand how the data is retrieved when Ocient processes a query.

Step One - TimeKey

Most queries will include some level of time filtering (on the column chosen to be the ). Ocient begins with coarse filtering on this column (event_time in the table) because each segment incorporates a min/max time value for all rows stored in it. Using this, Ocient can exclude segments without issuing any I/O on the data.

Step Two - The Cluster Key

After the initial time filtering step Ocient takes advantage of the clustered nature of the data on disk to isolate records further. The Cluster Key (CK) physically orders data so Ocient can locate any records that have the same key value very quickly. For this reason, one can only create a single group of cluster key columns. In the example table, note the first cluster key column CK1 (user_id) is ordered, and the paired cluster key column CK2 (app_id) is ordered for each matching value in CK1 (user_id) (i.e. because there are two records with user_id 918, the row containing app_id 8334 is ordered before the row containing app_id 8335).

Choosing cluster key columns which are heavily utilized in queries is pivotal to optimizing query performance.

When you define the cluster key index, you must specify columns in a contiguous and ascending order. The columns must be specified in the order of their location and order in the definition of the CREATE TABLE SQL statement.

By filtering on the chosen TimeKey and reading clustered data off of disk using Cluster Key Columns, Ocient can quickly retrieve rows for further query processing.

Indexes in Ocient

In addition to these keys, there are other indexes a user can create on a table to improve performance:

Secondary Index

Secondary indexes allow for improved performance when queries do not use the Clustering Key (CK). With a Clustering Key (CK1, CK2, CK3), the primary CK index can be used to efficiently satisfy query filters including CK1, (CK1,CK2), or (CK1,CK2,CK3) but not subsets that don’t include CK1 like (CK2,CK3) or (CK3,CK2). Secondary CK indexes make this possible. The ordering of these indexes is typically based on the column combinations expected in queries, taking clustering and cardinality knowledge of the data into account to create the most efficient indexes.

Secondary indexes do NOT change the way data is clustered or stored on disk. Instead they build a new mapping to data locations already on disk. As a result, data will often be read across many different segments, as opposed to consolidating reads on large spans of disk based on the Clustering Key. Consequently, secondary indexes are less efficient than using the primary cluster key index in most situations.

Two types of secondary indexes are available, depending on the column(s) on which they’re created:

  1. Cluster Key derived secondary indexes (these are referred to as “Secondary Cluster Key Indexes”)
  2. Non-Cluster Key column secondary indexes

Secondary Cluster Key Indexes

Secondary Cluster Key Indexes can be derived from the Clustering Key index by either reordering the columns in the Clustering Key or using a subset of the Clustering Key columns.

Reordered Clustering Key

While Clustering Keys define how the data is clustered on disk, a Secondary Cluster Key Index can improve query performance when data must be traversed in a different order than the Clustering Key. For example, if you defined a Secondary Cluster Key Index on the table as (CK3, CK1, CK2) all the device_id’s could be traversed in order. This allows for better query performance on queries that filter for specific device_ids and user_ids but are not restricting CK1 (app_id). Because Ocient will have an index indicating which segments contain data for specific values such as device_id == '88-042' AND user_id >= 918, Ocient can skip reading unmatched segments and deliver faster performance.

Subset of Clustering Key Columns

A user could also define a new Secondary Cluster Key Index by taking a subset containing just CK2 (app_id) if traversing the data on only the app_id, without CK1 (user_id) is used frequently in queries.

When to use a subset vs. a reordering of the cluster key columns to create an index:

A reordering of the cluster keys can be beneficial if the query filtering and selection usually contains all cluster keys but the entire cluster key is not used for filtering. For example, if queries frequently filter device_id and user_id at the same time and you want to retrieve the app_id, you could leverage a secondary cluster key index of (CK3, CK1, CK2). This is the case because you first can look at which rows have the required device_id values in your filter, which gives the user_id values for those device_ids. Then the result set includes the app_id values for these user_id values.

Similar reasoning applies to the use of a subset. If queries are usually only highly selective on some smaller set of the cluster keys, then a subset of the cluster key index would be preferred. For example, if there exists a large subset of queries where you do not filter on the app_id, but you still frequently filter on device_id and user_id to get location values, then you would want to define a secondary cluster key index of (CK3, CK1). This would grant us the ability to "skip" CK2 and parse the rows after CK1 directly.

Non-Cluster Key Column Secondary Indexes

Ocient has a few other types of secondary indexes that can be built on columns not in the clustering key. The index name and the column list are supplied when creating a secondary index, but the index type and index parameters are not specified by the user.

Instead, as data is loaded, the index type and index parameters are chosen automatically on a per-segment basis depending on the statistics of the data that will be indexed.

For requirements on supported column types and DDL syntax, see the Secondary Indexes page.

The Create Table Statement

You can use this CREATE TABLE SQL statement to create the table.

SQL


event_time is the TimeKey. BUCKET(1, DAY) sets the granularity of the time bucket to a fixed-width of one day. For more information on time buckets, see the CREATE TABLE section.

primary_index: the primary cluster key

idx01: a secondary cluster key index (a reordering of the primary cluster key index)

idx02: a secondary cluster key index (a subset of the cluster keys)

Permissions

To create a table, you must possess the CREATE TABLE privilege for the current database.

The cluster-key columns must be defined adjacent to one another in the schema.

Indexes Should Not Be Defined Everywhere

  1. Memory overhead: Indexes require significant space on every segment. For example, to create an inverted secondary index on a segment, a vector of row numbers must be created for each distinct value in the segment. This overhead grows quickly.
  2. Loading overhead: Indexes are created during segment generation, so loading time is heavily impacted by the number of indexes that must be built.

Indexes are built during segment generation, so old segments do not include index data for newly created indexes. Indexes are only built for data loaded after the index creation.

Observability

System tables exist to inspect details of indexes for tables:

  1. sys.indexes: id, name, index_type, table_id, index_use
  2. sys.index_columns: index_id, column_id, ordinal, ascending, column_ordinal
  3. sys.segment_parts: segment_group_id, ida_offset, name, part_type, size (can see the size of index segment parts), segment_ida_offset, segment_lba_offset

Read more about the system catalog in System Catalog Reference documentation.