Secondary indexes are extremely important for optimizing database queries. When used effectively, indexes allow the database to skip irrelevant records when reading data from storage. On large datasets, this can have a big impact, turning a query that would take minutes to execute into one that finishes in seconds or less. The tradeoff for using secondary indexes is that they do require additional disk space and loading resources. It is generally a good idea to be strategic about which columns you index. A column is usually a good candidate for an index if you expect it to be frequently referenced in queries, especially for searching and filtering (e.g., in a WHERE or ON clause of a SQL statement). Creating an index does not trigger the re-indexing of existing segments. Only segments that are generated after the execution of theDocumentation Index
Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
Use this file to discover all available pages before exploring further.
CREATE INDEX statement contain the new index. For that reason, create indexes early on before you load large amounts of data into a table.
You can create indexes on columns containing various data types as long as you meet the requirements in this table. Depending on the data type, the System can assign a default index type if you decline to specify one.
Index Type Requirements and Defaults
This table describes index types that are assigned by default to columns based on the data type. Container data types have special requirements for using indexes. Unless you explicitly specify an index type, the data type of a column determines a default index type for the system to create.| Data type | Requirement | Default Index Type (if not specified) |
|---|---|---|
| INT, BIGINT, SMALLINT, FLOAT, DOUBLE, BINARY, DECIMAL, HASH, CHAR, IP / IPV4, TIME, TIMESTAMP, DATE, UUID | None | INVERTED |
| VARCHAR | None | HASH |
| TUPLE | Inner data types are fixed length or VARCHAR | HASH |
| ARRAY | Inner data types are fixed length or VARCHAR. or Inner datatype is a tuple containing only fixed-length data types | HASH |
| POINT, LINESTRING, POLYGON | None | SPATIAL |
You can apply indexes regardless of whether they have GDC compression.
CREATE INDEX on tuples currently supports indexing only on individual tuple component columns and not entire tuples.Column Requirements for Index Optimization
All index types can speed up queries that involve filter expressions on their designated columns. An index takes effect when its specified column is queried with a filter expression of either\<column_name> <op> <literal> or <literal> <op> \<column_name>.
The column and operator must meet one of these requirements.
| Operator | Requirements |
|---|---|
=, !=, >, <, <=, >=, or BETWEEN. | The column must have a non-container data type. |
IS NULL or IS NOT NULL | Column must be nullable. |
LIKE, NOT LIKE, =, !=, SIMILAR TO, or NOT SIMILAR TO | Index type is NGRAM. |
| Operator or Functio****n | Syntax Example |
|---|---|
The operator is contains or *not contains, *i.e., @>. | \<filter_expression> = [NOT] (\<column_name> @> \<scalar_literal> | \<array_literal>) |
The operator is is contained or *not is contained by, *i.e., <@. | \<filter_expression> = [NOT] (\<scalar_literal> | \<array_literal> \<@ \<column_name>) |
The operator is *overlap *or *not overlap, *i.e., &&. | \<filter_expression> = [NOT] (\<column_name> && \<array_literal>) | (\<array_literal> && \<column_name>) |
The filter expression uses the for_some or for_all function. The operator is = ,!=, >, <, <=, or>=. | \<filter_expression> = [NOT] (for_all(\<column_name>) \<op> \<literal>) | (for_some(\<column_name>) \<op> \<literal>) |
BETWEEN is not supported for for_some or for_all operators with filter expressions. If the index type is NGRAM, the operator can also be LIKE, NOT LIKE, SIMILAR TO, or NOT SIMILAR TO.Secondary Index Types
Ocient supports four index types alongside the clustering index:INVERTED, HASH, NGRAM, and SPATIAL. An index notionally stores a mapping of the column value to the rows that contain that value, and the index type differentiates the format in which the column values are stored and accessed.
For container data types (e.g., arrays and tuples), the index stores the internal elements of the container and is used on predicates that target the internal values. However, a mapping of NULL column values is generally stored for both scalar and container data types, so the index can always be used for column IS NULL predicates.
INVERTED Index Type
An INVERTED index supports fixed-length column types. The value stored internally in the index is the whole column value, and its storage size is the same as the width of the data type.
This index allows for exact lookups from a column value to rows and supports strict equality comparisons and range comparisons (e.g., <, >, >=, \<=, BETWEEN). The index supports EQUALS predicates that perform exact matches on array values (e.g., CONTAINS, OVERLAPS, and for_all() and for_some() functions that use EQUALS or IS NULL), and range comparisons on array values using for_all() and for_some().
The system does not use the inverted index for all range comparisons on an index column. Instead, the system determines in advance when it is more performant to use the inverted index for range comparisons. For example, ranges that cover a smaller number of values are more likely to utilize the index.
| Data Type | Function | Supported Operators | Array Operators and Supported Functions |
|---|---|---|---|
| Fixed-length columns INT, BIGINT, SMALLINT, FLOAT, DOUBLE, BINARY, DECIMAL, HASH, CHAR, IP / IPV4, TIME, TIMESTAMP, DATE, UUID | Supports range and exact matches. The system stores the whole column value (storage is the same as data-type width). | =, !=, <, >, >=, <=, BETWEEN | for_all(), for_some() |
HASH Index Type
The system supports a HASH index on variable-length column types, which stores a hash of the indexed column value rather than the full value. You can use HASH indexes for exact comparisons, and they support equality and CONTAINS or OVERLAPS filters. The system supports range operators, such as < and >, and uses sorting logic to determine how VARCHAR types compare to each other.
The internal implementation of a HASH index requires additional record filtering, making the HASH index more computationally intensive than the INVERTED index. The additional filtering is attributable to the HASH index being a probabilistic index type; because hash collisions are possible, values returned by the index might include false positives that the database must filter internally during query execution to remove invalid results.
Do not use the HASH index for variable-length geospatial types, including LINESTRING and POLYGON. For better results, use the SPATIAL index instead.
In most cases, a HASH index on a VARCHAR column requires less storage than an NGRAM index.
| Data Type | Function | Supported Operators | Array Operators and Supported Functions |
|---|---|---|---|
| Variable-length columns VARCHAR, VARBINARY | Supports range and exact matches. The system stores a hash of the indexed column. | =, !=, <, >, >=, <=, BETWEEN | @>, <@, &&, for_all(), for_some() |
NGRAM Index Type
The Ocient System supports an NGRAM index on variable-length column types. An NGRAM index stores substrings of size n of the column values. For example, using the default n_value size of 3, you can divide the column value Ocient into the 3-grams { “Oci”, “cie”, “ien”, “ent” }. The system stores these 3-grams separately in the index and can find them individually.
The NGRAM index type helps optimize queries that involve either exact or partial matching filters. When the system uses an N-gram index with a filter string, the database divides the filter value into N-grams in the same manner as the column values, and the system finds each filter N-gram in the index.
NGRAM index results are probabilistic because the rows matched by the index can include strings that contain the right N-grams but do not satisfy the filter predicate. As a result, the Ocient query engine must filter column values returned from the index lookup for valid matches. The database uses the index to speed up string similarity where possible, but the index does not support all filters. At a minimum, a filter literal must contain at least one sequence of characters of the n_value length that must be present in the resulting column value. For instance, LIKE and SIMILAR TO expressions that are composed entirely of wildcards cannot use the NGRAM index.
For an example of partial matching, see this query on a table that has an NGRAM(3) index on the example_column column.
SQL
'Ocie%' results in the filter 3-grams { “Oci”, “cie” } . The database finds these 3-grams in the index and returns any rows that contain those substrings, for example, “Ocient”, “The Ocient database”, and “cieOci”. Of these strings, only “Ocient” satisfies the LIKE filter. The database filters out false positives and returns the valid rows.
For an exact matching example, see this query on the same table with the NGRAM(3) index.
SQL
NGRAM can significantly increase performance on equality matches, it is generally slower than a HASH index.
| Data Type | Function | Supported Operators | Array Operators and Supported Functions |
|---|---|---|---|
| Variable-length columns VARCHAR, VARBINARY | Supports exact and partial matches. System stores substrings of the n_value size of the column values. | =, !=, <, >, >=, <=, BETWEEN, LIKE, SIMILAR TO | @>, <@, &&, for_all(), for_some() |
SPATIAL Index Type
The SPATIAL index type helps expedite queries on geographic data types, including POINT, LINESTRING, and POLYGON values. In a table with a column of these objects, a SPATIAL index finds rows quickly using a geographic literal filter.
This index groups geographic objects together by using a packed Hilbert R-tree algorithm. The index allows queries to quickly search through rows by traversing only a subsection of the full data set.
For array types, the SPATIAL index is limited to a maximum of 1 million items per row.
| Data Type | Function | Supported Operators | Array Operators and Supported Functions |
|---|---|---|---|
Geographic columns that contain POINT, LINESTRING, or POLYGON values. | Supports bounding-box matches that can traverse a Hilbert R-tree structure. | Supports Geospatial Operators including: &&, @, ~=, ~.Supports Spatial Relationship functions including: ST_CONTAINS ST_CONTAINSPROPERLY ST_COVERS ST_COVEREDBY ST_CROSSES ST_DISJOINT ST_DWITHIN ST_INTERSECTS ST_OVERLAPS ST_TOUCHES ST_WITHIN | for_all(), for_some()Array operators can use the supported Geospatial Operators, but not the Spatial Relationship functions. |
ZONE_MAP Index Type
The ZONE_MAP index type is a lightweight structure that stores the range of values present in a segment for the indexed column, enabling queries to skip over entire segments with no rows matching a compatible filter.
Unlike other secondary indexes that store values for each row, the system stores ZONE_MAP indexes in the metadata of each system segment. The metadata stored for the indexed column includes its minimum and maximum values, as well as any special values (e.g., NULL).
In cases when a compatible filter is outside the bounds of the indexed column, the system can skip processing the segment without performing I/O to fetch raw column data.
To use ZONE_MAP indexes most effectively, use them on columns that are also included in a table clustering key because these columns set how the system orders segments on disk. For details, see Clustering Key.
| Data Type | Function | Supported Operators | Array Operators and Supported Functions |
|---|---|---|---|
| Fixed-length columns |
The system stores the minimum and maximum values for the column per segment.
|
=, <, >, >=, <=, BETWEEN| None |
Create Secondary Indexes Based on Anticipated Queries
These examples demonstrate how to set up indexes based on the expected queries that are frequently executed on the table. The examples each involve designing indexes to expedite queries on a set of telecommunications data. These queries include:- Exact matches for approximate location data (
ipaddress). - Range filtering for data usage (
totalpackets). - Partial matching for requests to specific URLs (
httprequesturl) - Bounding box matching for exact location data (
location)
ip_data. The columns and their data types in this table are shown here.
SQL
Optimize an Exact Match Query Using an INVERTED Index
This query uses the ip_data table to analyze the location of a customer within an approximate time range.
For this use case, a secondary index helps optimize this query template.
SQL
ipaddress, as it is the crux for the exact matching used in this query.
If data storage is not a problem, this query is ideal for an INVERTED index. This index supports exact matching for the IPV4 data type of the ipaddress column.
Create an inverted index on this column using this SQL statement.
SQL
Optimize a Range Query Using an INVERTED Index
Similar to the exact match query, this query also includes a range filter for the numeric value in the totalpackets column.
SQL
INVERTED index is the best choice for this use case. This index supports range filtering for all fixed-length types, including the BIGINT values in the totalpackets column.
Create an INVERTED index on this column using this SQL statement.
SQL
Optimize an Exact String Match Using a HASH Index
In this case, filter all records where customers accessed the exact web address ocient.com.
SQL
httprequesturl column only with exact matches, the HASH index is usually your best choice. The HASH index can greatly speed up exact matching on variable-length data types, and it requires less storage than an NGRAM index except in rare edge cases.
SQL
Optimize a Partial String Match Using NGRAM Index
For this use case, filter all records where customers accessed a partial or exact web address. The LIKE clause in this query template filters all records for those customers who accessed the Facebook URL.
SQL
NGRAM index is the best choice for this use case. This index supports partial matching for variable-length data types, including the VARCHAR data type of the httprequesturl column.
Create an NGRAM index on this column using this SQL statement.
SQL
NGRAM index divides the specified column into substrings of three characters, which is the default. This means partial strings such as “Fac”, “ace”, “boo” and other variants are captured in the filter for this query.
You can specify a broader index by setting a higher substring value, such as NGRAM(4) or NGRAM(5). Using one of these values would reduce the data storage requirements.
Optimize a Match Query Using a SPATIAL Index
For this use case, filter all records where customers are located in a specific area at a specific time.
This query represents the target area by the bounding box POLYGON((1 2,1 3,1 4,1 2)).
SQL
SPATIAL index is the best choice for this use case because it supports a variety of functions for filtering geospatial data for bounding box areas.
Create a SPATIAL index on the location column by using this SQL statement.
SQL
Optimize a Range Query Using a ZONE_MAP Index
This query analyzes network traffic originating from a specific range of IP addresses.
SQL
ipaddress), a ZONE_MAP index is highly efficient for this query. This index allows the system to quickly identify and scan only the segments that contain IP addresses within the requested range, skipping all others.
Create a ZONE_MAP index on this column using this SQL statement.
SQL

