Secondary Indexes

Secondary indexes are extremely important for optimizing database queries. When used effectively, indexes allow the database to skip irrelevant records when reading data from storage. On large datasets, this can have a big impact, turning a query that would take minutes to run into one that finishes in seconds or less. 
The tradeoff for using secondary indexes is they do require additional disk space and loading resources. It is generally a good idea to be strategic about which columns you index. A column is usually a good candidate for an index if you expect it to be frequently referenced in queries, especially for searching and filtering (e.g., in a WHERE or ON clause of a SQL statement).  
Creating an index does not trigger the re-indexing of existing segments. Only segments that are generated after the execution of the CREATE INDEX statement contain the new index. For that reason, create indexes early on before you load large amounts of data into a table. 
You can create indexes on columns containing various different data types as long as you meet the requirements in this table. Depending on the data type, the  System can assign different index types by default if you decline to specify this. 
Index Type Requirements and Defaults
This table describes index types that are assigned by default to columns based on the data type. Container data types have special requirements for using indexes. 
Unless you explicitly specify an index type, the data type of a column determines a default index type for the system to create. 
Data type
Requirement
Default Index Type 
(if not specified)
INT, BIGINT, SMALLINT, FLOAT, DOUBLE, BINARY, DECIMAL, HASH, CHAR, IP / IPV4, TIME, TIMESTAMP, DATE, UUID
None
INVERTED
VARCHAR
None
HASH
TUPLE
Inner data types are fixed length or VARCHAR
HASH
ARRAY
Inner data types are fixed length or VARCHAR.
or
Inner datatype is a tuple containing only fixed-length data types
HASH
POINT, LINESTRING, POLYGON
None
SPATIAL
You can apply indexes regardless of whether they have GDC compression. 
CREATE INDEX on tuples currently supports indexing only on individual tuple component columns and not entire tuples. 
Column Requirements for Index Optimization
All index types can speed up queries that involve filter expressions on their designated columns.  
An index takes effect when its specified column is queried with a filter expression of either <column_name> <op> <literal> or <literal> <op> <column_name>. 
The column and operator must meet one of these requirements.
Operator
Requirements
=, !=,  >, <, <=, >=, or BETWEEN.
The column must have a non-container data type.
IS NULL or IS NOT NULL
Column must be nullable. 
LIKE, NOT LIKE, =, !=, SIMILAR TO, or NOT SIMILAR TO
Index type is NGRAM.
Array operator requirements
Additionally, columns with arrays can use indexes if any of these criteria are met.  
Operator or Function
Syntax Example
The operator is contains or not contains, i.e., @>.
<filter_expression> = [NOT] (<column_name> @> <scalar_literal> | <array_literal>)
The operator is is contained or not is contained by, i.e., <@.
<filter_expression> = [NOT] (<scalar_literal> | <array_literal> <@ <column_name>)
The operator is overlap or not overlap, i.e., &&.
<filter_expression> = [NOT] (<column_name> && <array_literal>) | (<array_literal> && <column_name>)
The filter expression uses the for_some or for_all function. The operator is = ,!=,  >, <, <=, or>=. 
<filter_expression> = [NOT] (for_all(<column_name>) <op> <literal>) | (for_some(<column_name>) <op> <literal>)
BETWEEN is not supported for for_some or for_all operators with filter expressions. If the index type is NGRAM, the operator can also be LIKE, NOT LIKE, SIMILAR TO, or NOT SIMILAR TO.
For syntax information for secondary indexes, see the CREATE INDEX﻿ reference section. 
Secondary Index Types
Ocient supports four index types alongside the clustering index: INVERTED, HASH, NGRAM, and SPATIAL. An index notionally stores a mapping of the column value to the rows that contain that value, and the index type differentiates the format in which the column values are stored and accessed.
For container data types (e.g., arrays and tuples), the index stores the internal elements of the container and is used on predicates that target the internal values. However, a mapping of NULL column values is generally stored for both scalar and container data types, so the index can always be used for column IS NULL predicates. 
INVERTED Index Type 
An INVERTED index supports fixed-length column types. The value stored internally in the index is the whole column value, and its storage size is the same as the width of the data type. 
This index allows for exact lookups from a column value to rows and supports strict equality comparisons and range comparisons (e.g., <, >, >=, \<=, BETWEEN). The index supports EQUALS predicates that perform exact matches on array values (e.g., CONTAINS, OVERLAPS, and for_all() and for_some() functions that use EQUALS or IS NULL), and range comparisons on array values using for_all() and for_some(). 
The system does not use the inverted index for all range comparisons on an index column. Instead, the system determines in advance when it is more performant to use the inverted index for range comparisons. For example, ranges that cover a smaller number of values are more likely to utilize the index.
Data Type
Function
Supported Operators 
Array Operators and Supported Functions 
Fixed-length columns  
﻿
INT, BIGINT, SMALLINT, FLOAT, DOUBLE, BINARY, DECIMAL, HASH, CHAR, IP / IPV4, TIME, TIMESTAMP, DATE, UUID
Supports range and exact matches. 
﻿
The system stores the whole column value (storage is the same as data-type width). 
=, !=, <, >, >=, <=, BETWEEN
for_all(), for_some()
HASH Index Type
The system supports a HASH index on variable-length column types, which stores a hash of the indexed column value rather than the full value. You can use HASH indexes for exact comparisons, and they support equality and CONTAINS or OVERLAPS filters. The system supports range operators, such as < and >, and uses sorting logic to determine how VARCHAR types compare to each other. 
The internal implementation of a HASH index requires additional record filtering, making the HASH index more computationally intensive than the INVERTED index. The additional filtering is attributable to the HASH index being a probabilistic index type; because hash collisions are possible, values returned by the index might include false positives that the database must filter internally during query execution to remove invalid results.
Do not use the HASH index for variable-length geospatial types, including LINESTRING and POLYGON.  For better results, use the SPATIAL index instead. 
In most cases, a HASH index on a VARCHAR column requires less storage than an NGRAM index. 
Data Type
Function
Supported Operators
Array Operators and Supported Functions 
Variable-length columns
VARCHAR, VARBINARY
Supports range and exact matches. 
﻿
The system stores a hash of the indexed column. 
=, !=, <, >, >=, <=, BETWEEN
@>, <@, &&, for_all(), for_some()
NGRAM Index Type
The Ocient System supports an NGRAM index on variable-length column types. An NGRAM index stores substrings of size n of the column values. For example, using the default n_value size of 3, you can divide the column value Ocient into the 3-grams { "Oci", "cie", "ien", "ent" }. The system stores these 3-grams separately in the index and can find these individually. 
The NGRAM index type helps optimize queries that involve either exact or partial matching filters. When the system uses an N-gram index with a filter string, the database divides the filter value into N-grams in the same manner as the column values, and the system finds each filter N-gram in the index. 
NGRAM index results are probabilistic because the rows matched by the index can include strings that contain the right N-grams but do not satisfy the filter predicate. As a result, the Ocient query engine must filter column values returned from the index lookup for valid matches. The database uses the index to speed up string similarity where possible, but the index does not support all filters. At a minimum, a filter literal must contain at least one sequence of characters of the n_value length that must be present in the resulting column value. For instance, LIKE and SIMILAR TO expressions that are composed entirely of wildcards cannot use the NGRAM index.
For an example of partial matching, see this query on a table that has an NGRAM(3) index on the example_column column.
SQL
SELECT example_column FROM example_table WHERE example_column LIKE 'Ocie%'; 
SELECT example_column FROM example_table WHERE example_column LIKE 'Ocie%'; 
﻿
The filter string 'Ocie%' results in the filter 3-grams { "Oci", "cie" } . The database finds these 3-grams in the index and returns any rows that contain those substrings, for example, "Ocient", "The Ocient database", and "cieOci". Of these strings, only "Ocient" satisfies the LIKE filter. The database filters out false positives and returns the valid rows. 
For an exact matching example, see this query on the same table with the NGRAM(3) index. 
SQL
SELECT example_column FROM example_table WHERE example_column = 'Ocie%'; 
SELECT example_column FROM example_table WHERE example_column = 'Ocie%'; 
﻿
In this case, the database uses similar logic by finding the substrings { "Oci", "cie", "ien" } in the index. The database returns only strings that exactly match the input string. While NGRAM can significantly increase performance on equality matches, it is generally slower than a HASH index.  
Data Type
Function
Supported Operators
Array Operators and Supported Functions 
Variable-length columns
﻿
VARCHAR, VARBINARY
Supports exact and partial matches. 
﻿
System stores substrings of the n_value size of the column values.  
=, !=, <, >, >=, \<=, BETWEEN, LIKE, SIMILAR TO
@>, <@, &&, for_all(), for_some()
SPATIAL Index Type
The SPATIAL index type helps expedite queries on geographic data types, including POINT, LINESTRING, and POLYGON values. In a table with a column of these objects, a SPATIAL index finds rows quickly using a geographic literal filter. 
This index groups geographic objects together by using a packed Hilbert R-tree algorithm. The index allows queries to quickly search through rows by traversing only a subsection of the full data set. 
For array types, the SPATIAL index is limited to a maximum of 1 million items per row. 
Data Type
Function
Supported Operators
Array Operators and Supported Functions
Geographic columns that contain POINT, LINESTRING, or POLYGON values.
Supports bounding-box matches that can traverse a Hilbert R-tree structure. 
﻿
﻿
Supports Geospatial Operators﻿ including: &&, @, ~=, ~.
Supports Spatial Relationship functions including:
﻿ST_CONTAINS﻿ 
﻿ST_CONTAINSPROPERLY﻿﻿
﻿ST_COVERS﻿﻿
﻿ST_COVEREDBY﻿ 
﻿ST_CROSSES﻿ 
﻿ST_DISJOINT﻿ 
﻿ST_DWITHIN﻿ 
﻿ST_INTERSECTS﻿ 
﻿ST_OVERLAPS﻿ 
﻿ST_TOUCHES﻿ 
﻿ST_WITHIN﻿﻿
for_all(), for_some()
Array operators can use the supported Geospatial Operators﻿, but not the Spatial Relationship functions.   
Create Secondary Indexes Based on Anticipated Queries
These examples demonstrate how to set up indexes based on the expected queries that are frequently executed on the table. 
The examples each involve designing indexes to expedite queries on a set of telecommunications data. These queries include: 
Exact matches for approximate location data (ipaddress).  
Range filtering for data usage (totalpackets). 
Partial matching for requests to specific URLs (httprequesturl) 
Bounding box matching for exact location data (location)
These examples use a shared table of internet data records, ip_data. The columns and their data types in this table are shown here. 
SQL
starttime (TIMESTAMP)
endtime (TIMESTAMP)
totalpackets (BIGINT)
httphost (CHAR) nullable
httprequesturl (CHAR) nullable
httpstatuscode (SMALLINT) nullable
httpuseragent (CHAR) nullable
destinationport (INT)
sourceport (INT)
ipaddress (IPV4)
ipprotocol (BYTE)
location (POINT)
starttime (TIMESTAMP)
endtime (TIMESTAMP)
totalpackets (BIGINT)
httphost (CHAR) nullable
httprequesturl (CHAR) nullable
httpstatuscode (SMALLINT) nullable
httpuseragent (CHAR) nullable
destinationport (INT)
sourceport (INT)
ipaddress (IPV4)
ipprotocol (BYTE)
location (POINT)
﻿
Optimize an Exact Match Query Using an INVERTED Index
This query uses the ip_data table to analyze the location of a customer within an approximate time range. 
For this use case, a secondary index helps optimize this query template. 
SQL
SELECT starttime,
       CHAR(ipaddress),
       sourceport,
       destinationport,
       totalpackets
FROM   ip_data
WHERE  starttime BETWEEN '<start_time>' AND '<end_time>'
  AND  ipaddress = '<ip_address>'
LIMIT  10;
SELECT starttime,
       CHAR(ipaddress),
       sourceport,
       destinationport,
       totalpackets
FROM   ip_data
WHERE  starttime BETWEEN '<start_time>' AND '<end_time>'
  AND  ipaddress = '<ip_address>'
LIMIT  10;
﻿
The crucial column is the ipaddress, as it is the crux for the exact matching used in this query. 
If data storage is not a problem, this query is ideal for an INVERTED index. This index supports exact matching for the IPV4 data type of the ipaddress column.
Create an inverted index on this column using this SQL statement. 
SQL
CREATE INDEX destination_ipaddress ON ip_data (ipaddress) USING INVERTED;
CREATE INDEX destination_ipaddress ON ip_data (ipaddress) USING INVERTED;
﻿
Optimize a Range Query Using an INVERTED Index
Similar to the exact match query, this query also includes a range filter for the numeric value in the totalpackets column.
SQL
SELECT starttime,
       CHAR(destinationipaddress),
       sourceport,
       destinationport,
       totalpackets
FROM   ip_data
WHERE  flowstarttime between '<start_time>' AND '<end_time>'
  AND  ipaddress = '<ip_address>'
  AND  totalpackets between 200 and 400
LIMIT 10;
SELECT starttime,
       CHAR(destinationipaddress),
       sourceport,
       destinationport,
       totalpackets
FROM   ip_data
WHERE  flowstarttime between '<start_time>' AND '<end_time>'
  AND  ipaddress = '<ip_address>'
  AND  totalpackets between 200 and 400
LIMIT 10;
﻿
An INVERTED index is the best choice for this use case. This index supports range filtering for all fixed-length types, including the BIGINT values in the totalpackets column.
Create an INVERTED index on this column using this SQL statement. 
SQL
CREATE INDEX totalpackets ON ip_data (totalpackets) USING INVERTED;
CREATE INDEX totalpackets ON ip_data (totalpackets) USING INVERTED;
﻿
Optimize an Exact String Match Using a HASH Index
In this case, filter all records where customers accessed the exact web address ocient.com.
SQL
SELECT starttime,
       CHAR(sourceipaddress),
       CHAR(ipaddress),
       sourceport,
       destinationport,
       httprequesturl
FROM   ip_data
WHERE  starttime between '<start_time>' AND '<end_time>'
  AND  httprequesturl = 'ocient.com'
LIMIT 10;
SELECT starttime,
       CHAR(sourceipaddress),
       CHAR(ipaddress),
       sourceport,
       destinationport,
       httprequesturl
FROM   ip_data
WHERE  starttime between '<start_time>' AND '<end_time>'
  AND  httprequesturl = 'ocient.com'
LIMIT 10;
﻿
If you expect to filter the httprequesturl column only with exact matches, the HASH index is usually your best choice. The HASH index can greatly speed up exact matching on variable-length data types, and it requires less storage than an NGRAM index except in rare edge cases. 
SQL
CREATE INDEX httprequesturl ON ip_data (httprequesturl) USING HASH;
CREATE INDEX httprequesturl ON ip_data (httprequesturl) USING HASH;
﻿
Optimize a Partial String Match Using NGRAM Index
For this use case, filter all records where customers accessed a partial or exact web address. The LIKE clause in this query template filters all records for those customers who accessed the Facebook URL. 
SQL
SELECT starttime,
       CHAR(sourceipaddress),
       CHAR(ipaddress),
       sourceport,
       destinationport,
       httprequesturl
FROM   ip_data
WHERE  starttime between '<start_time>' AND '<end_time>'
  AND  httprequesturl LIKE '%facebook.com%'
LIMIT 10;
SELECT starttime,
       CHAR(sourceipaddress),
       CHAR(ipaddress),
       sourceport,
       destinationport,
       httprequesturl
FROM   ip_data
WHERE  starttime between '<start_time>' AND '<end_time>'
  AND  httprequesturl LIKE '%facebook.com%'
LIMIT 10;
﻿
An NGRAM index is the best choice for this use case. This index supports partial matching for variable-length data types, including the VARCHAR data type of the httprequesturl column. 
Create an NGRAM index on this column using this SQL statement. 
SQL
CREATE INDEX httprequesturl ON ip_data (httprequesturl) USING NGRAM(3);
CREATE INDEX httprequesturl ON ip_data (httprequesturl) USING NGRAM(3);
﻿
In this example, the NGRAM index divides the specified column into substrings of three characters, which is the default. This means partial strings such as "Fac", "ace", "boo" and other variants are captured in the filter for this query. 
You can specify a broader index by setting a higher substring value, such as NGRAM(4) or NGRAM(5). Using one of these values would reduce the data storage requirements.  
Optimize a Match Query Using a SPATIAL Index
For this use case, filter all records where customers are located in a specific area at a specific time. 
This query represents the target area by the bounding box POLYGON((1 2,1 3,1 4,1 2)). 
SQL
SELECT starttime,
       CHAR(sourceipaddress),
       CHAR(ipaddress),
       sourceport,
       destinationport,
       httprequesturl, 
       location
FROM   ip_data
WHERE  starttime between '<start_time>' AND '<end_time>'
  AND  ST_CONTAINS(location, POLYGON((1 2,1 3,1 4,1 2)))
LIMIT 10;
SELECT starttime,
       CHAR(sourceipaddress),
       CHAR(ipaddress),
       sourceport,
       destinationport,
       httprequesturl, 
       location
FROM   ip_data
WHERE  starttime between '<start_time>' AND '<end_time>'
  AND  ST_CONTAINS(location, POLYGON((1 2,1 3,1 4,1 2)))
LIMIT 10;
﻿
A SPATIAL index is the best choice for this use case because it supports a variety of functions for filtering geospatial data for bounding box areas. 
Create a SPATIAL index on the location column by using this SQL statement. 
SQL
CREATE INDEX location ON ip_data (location) USING SPATIAL;
CREATE INDEX location ON ip_data (location) USING SPATIAL;
﻿
Related Links
﻿Ingest Data with Legacy LAT Reference﻿﻿
﻿Query Ocient﻿﻿
﻿CREATE INDEX﻿﻿
﻿DROP INDEX﻿﻿
Related Videos
﻿At the Whiteboard with Ocient: Indexing at Hyperscale﻿

Data type	Requirement	Default Index Type (if not specified)
INT, BIGINT, SMALLINT, FLOAT, DOUBLE, BINARY, DECIMAL, HASH, CHAR, IP / IPV4, TIME, TIMESTAMP, DATE, UUID	None	INVERTED
VARCHAR	None	HASH
TUPLE	Inner data types are fixed length or VARCHAR	HASH
ARRAY	Inner data types are fixed length or VARCHAR. or Inner datatype is a tuple containing only fixed-length data types	HASH
POINT, LINESTRING, POLYGON	None	SPATIAL

Operator	Requirements
=, !=, >, <, <=, >=, or BETWEEN.	The column must have a non-container data type.
IS NULL or IS NOT NULL	Column must be nullable.
LIKE, NOT LIKE, =, !=, SIMILAR TO, or NOT SIMILAR TO	Index type is NGRAM.

Operator or Function	Syntax Example
The operator is contains or not contains, i.e., @>.	<filter_expression> = [NOT] (<column_name> @> <scalar_literal> \| <array_literal>)
The operator is is contained or not is contained by, i.e., <@.	<filter_expression> = [NOT] (<scalar_literal> \| <array_literal> <@ <column_name>)
The operator is overlap or not overlap, i.e., &&.	<filter_expression> = [NOT] (<column_name> && <array_literal>) \| (<array_literal> && <column_name>)
The filter expression uses the for_some or for_all function. The operator is = ,!=, >, <, <=, or>=.	<filter_expression> = [NOT] (for_all(<column_name>) <op> <literal>) \| (for_some(<column_name>) <op> <literal>)

Data Type	Function	Supported Operators	Array Operators and Supported Functions
Fixed-length columns INT, BIGINT, SMALLINT, FLOAT, DOUBLE, BINARY, DECIMAL, HASH, CHAR, IP / IPV4, TIME, TIMESTAMP, DATE, UUID	Supports range and exact matches. The system stores the whole column value (storage is the same as data-type width).	=, !=, <, >, >=, <=, BETWEEN	for_all(), for_some()

Data Type	Function	Supported Operators	Array Operators and Supported Functions
Variable-length columns VARCHAR, VARBINARY	Supports range and exact matches. The system stores a hash of the indexed column.	=, !=, <, >, >=, <=, BETWEEN	@>, <@, &&, for_all(), for_some()

Data Type	Function	Supported Operators	Array Operators and Supported Functions
Variable-length columns VARCHAR, VARBINARY	Supports exact and partial matches. System stores substrings of the n_value size of the column values.	=, !=, <, >, >=, \<=, BETWEEN, LIKE, SIMILAR TO	@>, <@, &&, for_all(), for_some()

Data Type	Function	Supported Operators	Array Operators and Supported Functions
Geographic columns that contain POINT, LINESTRING, or POLYGON values.	Supports bounding-box matches that can traverse a Hilbert R-tree structure.	Supports Geospatial Operators including: &&, @, ~=, ~. Supports Spatial Relationship functions including: ST_CONTAINS ST_CONTAINSPROPERLY ST_COVERS ST_COVEREDBY ST_CROSSES ST_DISJOINT ST_DWITHIN ST_INTERSECTS ST_OVERLAPS ST_TOUCHES ST_WITHIN	for_all(), for_some() Array operators can use the supported Geospatial Operators, but not the Spatial Relationship functions.

TimeKeys and Clustering Keys

Table Compression Options