Secondary Indexes
Secondary indexes are extremely important for optimizing database queries. When used effectively, indexes allow the database to skip irrelevant records when reading data from storage. On large datasets, this can have a big impact, turning a query that would take minutes to run into one that finishes in seconds or less.
The tradeoff for using secondary indexes is they do require additional disk space and loading resources. It is generally a good idea to be strategic about which columns you index. A column is usually a good candidate for an index if you expect it to be frequently referenced in queries, especially for searching and filtering (e.g., in a WHERE or ON clause of a SQL statement).
Creating an index does not trigger re-indexing of existing segments. Only segments that are generated after the execution of the CREATE INDEX statement contain the new index. For that reason, create indexes early on before you load large amounts of data into a table.
You can create indexes on columns containing various different data types as long as you meet the requirements in this table. Depending on the data type, the system can assign different index types by default if you decline to specify this.
This table describes index types that are assigned by default to columns based on the data type. Container data types have special requirements for using indexes.
Unless you explicitly specify an index type, the data type of a column determines a default index type for the system to create.
Data type | Requirement | Default Index Type (if not specified) |
---|---|---|
INT, BIGINT, SMALLINT, FLOAT, DOUBLE, BINARY, DECIMAL, HASH, CHAR, IP / IPV4, TIME, TIMESTAMP, DATE, UUID | None | INVERTED |
VARCHAR | None | HASH |
TUPLE | - Inner data types are fixed length or VARCHAR | HASH |
ARRAY | - Inner data types are fixed length or VARCHAR. OR - Inner datatype is a tuple containing only fixed-length data types | HASH |
POINT, LINESTRING, POLYGON | None | SPATIAL |
You can apply indexes regardless of whether they have GDC compression.
CREATE INDEX on tuples currently supports indexing only on individual tuple component columns, and not entire tuples.
All index types can speed up queries that involve filter expressions on their designated columns.
An index takes effect when its specified column is queried with a filter expression of either <column_name> <op> <literal> or <literal> <op> <column_name>.
The column and operator must meet one of these requirements.
Operator | Requirements |
---|---|
=, !=, >, <, <=, >=, or BETWEEN. | The column must have a non-container data type. |
IS NULL or IS NOT NULL | Column must be nullable. |
LIKE, NOT LIKE, =, !=, SIMILAR TO, or NOT SIMILAR TO | Index type is NGRAM. |
Array operator requirements
Additionally, columns with arrays can use indexes if any of these criteria are met.
Operator or Function | Syntax Example |
---|---|
The operator is contains or not contains, i.e. @>. | <filter_expression> = [NOT] (<column_name> @> <scalar_literal> | <array_literal>) |
The operator is is contained or not is contained by, i.e. <@. | <filter_expression> = [NOT] (<scalar_literal> | <array_literal> <@ <column_name>) |
The operator is overlap or not overlap, i.e. &&. | <filter_expression> = [NOT] (<column_name> && <array_literal>) | (<array_literal> && <column_name>) |
The filter expression uses the for_some or for_all function. The operator is = ,!=, >, <, <=, or>=. | <filter_expression> = [NOT] (for_all(<column_name>) <op> <literal>) | (for_some(<column_name>) <op> <literal>) |
BETWEEN is not supported for for_some or for_all operators with filter expressions. If the index type is NGRAM, the operator can also be LIKE, NOT LIKE, SIMILAR TO, or NOT SIMILAR TO.
For syntax information for secondary indexes, see the CREATE INDEX reference section.
Ocient supports four index types alongside the clustering index: INVERTED, HASH, NGRAM, and SPATIAL. An index notionally stores a mapping of the column value to the rows that contain that value, and the index type differentiates the format in which the column values are stored and accessed.
For container data types (e.g., arrays and tuples), the index stores the internal elements of the container, and is used on predicates that target the internal values. However, a mapping of NULL column values is generally stored for both scalar and container data types, so the index can always be used for column IS NULL predicates.
An INVERTED index supports fixed-length column types. The value stored internally in the index is the whole column value, and its storage size is the same as the width of the data type.
This index allows for exact lookups from a column value to rows, and supports strict equality comparisons and range comparisons (e.g., <, >, >=, \<=, BETWEEN). The index supports EQUALS predicates that perform exact matches on array values (e.g., CONTAINS, OVERLAPS, and for_all() and for_some() functions that use EQUALS or IS NULL), and range comparisons on array values using for_all() and for_some().
The system does not use the inverted index for all range comparisons on an index column. Instead, the system determines in advance when it is more performant to use the inverted index for range comparisons. For example, ranges that cover a smaller number of values are more likely to utilize the index.
Data Type | Function | Supported Operators | Array Operators and Supported Functions |
---|---|---|---|
Fixed-length columns INT, BIGINT, SMALLINT, FLOAT, DOUBLE, BINARY, DECIMAL, HASH, CHAR, IP / IPV4, TIME, TIMESTAMP, DATE, UUID | Supports range, and exact matches. The system stores the whole column value (storage is the same as data-type width). | =, !=, <, >, >=, <=, BETWEEN | for_all(), for_some() |
The system supports a HASH index on variable-length column types, which stores a hash of the indexed column value rather than the full value. You can use HASH indexes for exact comparisons and they support equality and CONTAINS or OVERLAPS filters. The system supports range operators, such as < and >, and uses sorting logic to determine how VARCHAR types compare to each other.
The internal implementation of a HASH index requires additional record filtering, making the HASH index more computationally intensive than the INVERTED index. The additional filtering is attributable to the HASH index being a probabilistic index type; because hash collisions are possible, values returned by the index might include false positives that the database must filter internally during query execution to remove invalid results.
Do not use the HASH index for variable-length geospatial types, including LINESTRING and POLYGON. For better results, use the SPATIAL index instead.
In most cases, a HASH index on a VARCHAR column requires less storage than an NGRAM index.
Data Type | Function | Supported Operators | Array Operators and Supported Functions |
---|---|---|---|
Variable-length columns VARCHAR, VARBINARY | Supports range, and exact matches. The system stores a hash of the indexed column. | =, !=, <, >, >=, <=, BETWEEN | @>, <@, &&, for_all(), for_some() |
The Ocient system supports an NGRAM index on variable-length column types. An NGRAM index stores substrings of size n of the column values. For example, using the default n_value size of 3, you can divide the column value Ocient into the 3-grams { "Oci", "cie", "ien", "ent" }. The system stores these 3-grams separately in the index and can find these individually.
The NGRAM index type helps optimize queries that involve either exact or partial matching filters. When the system uses an N-gram index with a filter string, the database divides the filter value into N-grams in the same manner as the column values, and the system finds each filter N-gram in the index.
NGRAM index results are probabilistic because the rows matched by the index can include strings that contain the right N-grams but do not satisfy the filter predicate. As a result, the Ocient query engine must filter column values returned from the index lookup for valid matches. The database uses the index to speed up string similarity where possible, but the index does not support all filters. At a minimum, a filter literal must contain at least one sequence of characters of the n_value length that must be present in the resulting column value. For instance, LIKE and SIMILAR TO expressions that are composed entirely of wildcards cannot use the NGRAM index.
For an example of partial matching, see this query on a table that has an NGRAM(3) index on the example_column column.
The filter string 'Ocie%' results in the filter 3-grams { "Oci", "cie" } . The database finds these 3-grams in the index and returns any rows that contain those substrings, for example, "Ocient", "The Ocient database", and "cieOci". Of these strings, only "Ocient" satisfies the LIKE filter. The database filters out false positives and returns the valid rows.
For an exact matching example, see this query on the same table with the NGRAM(3) index.
In this case, the database uses similar logic by finding the substrings { "Oci", "cie", "ien" } in the index. The database returns only strings that exactly match the input string. While NGRAM can significantly increase performance on equality matches, it is generally slower than a HASH index.
Data Type | Function | Supported Operators | Array Operators and Supported Functions |
---|---|---|---|
Variable-length columns VARCHAR, VARBINARY | Supports exact and partial matches. System stores substrings of the n_value size of the column values. | =, !=, <, >, >=, \<=, BETWEEN, LIKE, SIMILAR TO | @>, <@, &&, for_all(), for_some() |
The SPATIAL index type helps expedite queries on geographic data types, including POINT, LINESTRING, and POLYGON values. In a table with a column of these objects, a SPATIAL index finds rows quickly using a geographic literal filter.
This index groups geographic objects together by using a packed Hilbert R-tree algorithm. The index allows queries to quickly search through rows by traversing only a subsection of the full data set.
For array types, the SPATIAL index is limited to a maximum of 1 million items per row.
Data Type | Function | Supported Operators | Array Operators and Supported Functions |
---|---|---|---|
Geographic columns that contain POINT, LINESTRING, or POLYGON values. | Supports bounding-box matches that can traverse a Hilbert R-tree structure. | Supports Geospatial Operators including: &&, @, ~=, ~. Supports Spatial Relationship functions including: ST_COVERS ST_WITHIN | for_all(), for_some() Array operators can use the supported Geospatial Operators, but not the Spatial Relationship functions. |
These examples demonstrate how to set up indexes based on the expected queries that are frequently executed on the table.
The examples each involve designing indexes to expedite queries on a set of telecommunications data. These queries include:
- Exact matches for approximate location data (ipaddress).
- Range filtering for data usage (totalpackets).
- Partial matching for requests to specific URLs (httprequesturl)
- Bounding box matching for exact location data (location)
These examples use a shared table of internet data records, ip_data. The columns and their data types in this table are shown here.
This query uses the ip_data table to analyze the location of a customer within an approximate time range.
For this use case, a secondary index helps optimize this query template.
The crucial column is the ipaddress, as it is the crux for the exact matching used in this query.
If data storage is not a problem, this query is ideal for an INVERTED index. This index supports exact matching for the IPV4 data type of the ipaddress column.
Create an inverted index on this column using this SQL statement.
Similar to the exact match query, this query also includes a range filter for the numeric value in the totalpackets column.
An INVERTED index is the best choice for this use case. This index supports range filtering for all fixed-length types, including the BIGINT values in the totalpackets column.
Create an INVERTED index on this column using this SQL statement.
In this case, filter all records where customers accessed the exact web address ocient.com.
If you expect to filter the httprequesturl column only with exact matches, the HASH index in most cases is your best choice. The HASH index can greatly speed up exact matching on variable-length data types, and it requires less storage than an NGRAM index except in rare edge cases.
For this use case, filter all records where customers accessed a partial or exact web address. The LIKE clause in this query template filters all records for those customers who accessed the Facebook URL.
An NGRAM index is the best choice for this use case. This index supports partial matching for variable-length data types, including the VARCHAR data type of the httprequesturl column.
Create an NGRAM index on this column using this SQL statement.
In this example, the NGRAM index divides the specified column into substrings of three characters, which is the default. This means partial strings such as "Fac", "ace", "boo" and other variants are captured in the filter for this query.
You can specify a broader index by setting a higher substring value, such as NGRAM(4) or NGRAM(5). Using one of these values would reduce the data storage requirements.
For this use case, filter all records where customers are located in a specific area at a specific time.
This query represents the target area by the bounding box POLYGON((1 2,1 3,1 4,1 2)).
A SPATIAL index is the best choice for this use case because it supports a variety of functions for filtering geospatial data for bounding box areas.
Create a SPATIAL index on the location column by using this SQL statement.
Load Data
Query Ocient
CREATE INDEX
DROP INDEX