Database Administration
Schema Design

Secondary Indexes

secondary indexes are extremely important for optimizing database queries when used effectively, indexes allow the database to skip irrelevant records when reading data from storage on large datasets, this can have a big impact, turning a query that would take minutes to run into one that finishes in seconds or less the tradeoff for using secondary indexes is they do require additional disk space and loading resources it is generally a good idea to be strategic about which columns you index a column is usually a good candidate for an index if you expect it to be frequently referenced in queries, especially for searching and filtering (e g , in a where or on clause of a sql statement) creating an index does not trigger the re indexing of existing segments only segments that are generated after the execution of the create index statement contain the new index for that reason, create indexes early on before you load large amounts of data into a table you can create indexes on columns containing various different data types as long as you meet the requirements in this table depending on the data type, the {{ocient}} system can assign different index types by default if you decline to specify this index type requirements and defaults this table describes index types that are assigned by default to columns based on the data type container data types have special requirements for using indexes unless you explicitly specify an index type, the data type of a column determines a default index type for the system to create data type requirement default index type (if not specified) int, bigint, smallint, float, double, binary, decimal, hash, char, ip / ipv4, time, timestamp, date, uuid none inverted varchar none hash tuple inner data types are fixed length or varchar hash array inner data types are fixed length or varchar or inner datatype is a tuple containing only fixed length data types hash point, linestring, polygon none spatial you can apply indexes regardless of whether they have gdc compression create index on tuples currently supports indexing only on individual tuple component columns and not entire tuples column requirements for index optimization all index types can speed up queries that involve filter expressions on their designated columns an index takes effect when its specified column is queried with a filter expression of either \<column name> \<op> \<literal> or \<literal> \<op> \<column name> the column and operator must meet one of these requirements operator requirements = , != , > , < , <= , >= , or between the column must have a non container data type is null or is not null column must be nullable like , not like , = , != , similar to , or not similar to index type is ngram array operator requirement s additionally, columns with arrays can use indexes if any of these criteria are met operator or functio n syntax example the operator is contains or not contains, i e , @> \<filter expression> = \[not] (\<column name> @> \<scalar literal> | \<array literal>) the operator is is contained or not is contained by, i e , <@ \<filter expression> = \[not] (\<scalar literal> | \<array literal> <@ \<column name>) the operator is overlap or not overlap, i e , && \<filter expression> = \[not] (\<column name> && \<array literal>) | (\<array literal> && \<column name>) the filter expression uses the for some or for all function the operator is = , != , > , < , <= , or >= \<filter expression> = \[not] (for all(\<column name>) \<op> \<literal>) | (for some(\<column name>) \<op> \<literal>) between is not supported for for some or for all operators with filter expressions if the index type is ngram , the operator can also be like , not like , similar to , or not similar to for syntax information for secondary indexes, see the database, tables, views, and indexes docid\ a4jvhkvg31tapexr9zpcq reference section secondary index types ocient supports four index types alongside the clustering index inverted , hash , ngram , and spatial an index notionally stores a mapping of the column value to the rows that contain that value, and the index type differentiates the format in which the column values are stored and accessed for container data types (e g , arrays and tuples), the index stores the internal elements of the container and is used on predicates that target the internal values however, a mapping of null column values is generally stored for both scalar and container data types, so the index can always be used for column is null predicates inverted index type an inverted index supports fixed length column types the value stored internally in the index is the whole column value, and its storage size is the same as the width of the data type this index allows for exact lookups from a column value to rows and supports strict equality comparisons and range comparisons (e g , < , > , >= , \\<= , between ) the index supports equals predicates that perform exact matches on array values (e g , contains , overlaps , and for all() and for some() functions that use equals or is null ), and range comparisons on array values using for all() and for some() the system does not use the inverted index for all range comparisons on an index column instead, the system determines in advance when it is more performant to use the inverted index for range comparisons for example, ranges that cover a smaller number of values are more likely to utilize the index data type function supported operators array operators and supported functions fixed length columns int, bigint, smallint, float, double, binary, decimal, hash, char, ip / ipv4, time, timestamp, date, uuid supports range and exact matches the system stores the whole column value (storage is the same as data type width) = , != , < , > , >= , <= , between for all() , for some() hash index type the system supports a hash index on variable length column types, which stores a hash of the indexed column value rather than the full value you can use hash indexes for exact comparisons, and they support equality and contains or overlaps filters the system supports r ange operators, such as < and > , and uses sorting logic to determine how varchar types compare to each other the internal implementation of a hash index requires additional record filtering, making the hash index more computationally intensive than the inverted index the additional filtering is attributable to the hash index being a probabilistic index type; because hash collisions are possible, values returned by the index might include false positives that the database must filter internally during query execution to remove invalid results do not use the hash index for variable length geospatial types, including linestring and polygon for better results, use the spatial index instead in most cases, a hash index on a varchar column requires less storage than an ngram index data type function supported operators array operators and supported functions variable length columns varchar, varbinary supports range and exact matches the system stores a hash of the indexed column = , != , < , > , >= , <= , between @> , <@ , && , for all() , for some() ngram index type the ocient system supports an ngram index on variable length column types an ngram index stores substrings of size n of the column values for example, using the default n value size of 3, you can divide the column value ocient into the 3 grams { "oci", "cie", "ien", "ent" } the system stores these 3 grams separately in the index and can find these individually the ngram index type helps optimize queries that involve either exact or partial matching filters when the system uses an n gram index with a filter string, the database divides the filter value into n grams in the same manner as the column values, and the system finds each filter n gram in the index ngram index results are probabilistic because the rows matched by the index can include strings that contain the right n grams but do not satisfy the filter predicate as a result, the ocient query engine must filter column values returned from the index lookup for valid matches the database uses the index to speed up string similarity where possible, but the index does not support all filters at a minimum, a filter literal must contain at least one sequence of characters of the n value length that must be present in the resulting column value for instance, like and similar to expressions that are composed entirely of wildcards cannot use the ngram index for an example of partial matching, see this query on a table that has an ngram(3) index on the example column column select example column from example table where example column like 'ocie%'; the filter string 'ocie%' results in the filter 3 grams { "oci", "cie" } the database finds these 3 grams in the index and returns any rows that contain those substrings, for example, "ocient", "the ocient database", and "cieoci" of these strings, only "ocient" satisfies the like filter the database filters out false positives and returns the valid rows for an exact matching example, see this query on the same table with the ngram(3) index select example column from example table where example column = 'ocie%'; in this case, the database uses similar logic by finding the substrings { "oci", "cie", "ien" } in the index the database returns only strings that exactly match the input string while ngram can significantly increase performance on equality matches, it is generally slower than a hash index data type function supported operators array operators and supported functions variable length columns varchar, varbinary supports exact and partial matches system stores substrings of the n value size of the column values = , != , < , > , >= , \\<= , between , like , similar to @> , <@ , && , for all() , for some() spatial index type the spatial index type helps expedite queries on geographic data types, including point, linestring, and polygon values in a table with a column of these objects, a spatial index finds rows quickly using a geographic literal filter this index groups geographic objects together by using a packed hilbert r tree algorithm the index allows queries to quickly search through rows by traversing only a subsection of the full data set for array types, the spatial index is limited to a maximum of 1 million items per row data type function supported operators array operators and supported functions geographic columns that contain point , linestring , or polygon values supports bounding box matches that can traverse a hilbert r tree structure supports geospatial functions docid\ hpngsvdeojhofhdmpyrmh including && , @ , = , supports spatial relationship functions including spatial relationships docid\ nrkafmnhcwwnlf2 p16rm spatial relationships docid\ nrkafmnhcwwnlf2 p16rm spatial relationships docid\ nrkafmnhcwwnlf2 p16rm spatial relationships docid\ nrkafmnhcwwnlf2 p16rm spatial relationships docid\ nrkafmnhcwwnlf2 p16rm spatial relationships docid\ nrkafmnhcwwnlf2 p16rm spatial relationships docid\ nrkafmnhcwwnlf2 p16rm spatial relationships docid\ nrkafmnhcwwnlf2 p16rm spatial relationships docid\ nrkafmnhcwwnlf2 p16rm spatial relationships docid\ nrkafmnhcwwnlf2 p16rm spatial relationships docid\ nrkafmnhcwwnlf2 p16rm for all() , for some() array operators can use the supported geospatial functions docid\ hpngsvdeojhofhdmpyrmh , but not the spatial relationship functions create secondary indexes based on anticipated queries these examples demonstrate how to set up indexes based on the expected queries that are frequently executed on the table the examples each involve designing indexes to expedite queries on a set of telecommunications data these queries include exact matches for approximate location data ( ipaddress ) range filtering for data usage ( totalpackets ) partial matching for requests to specific urls ( httprequesturl ) bounding box matching for exact location data ( location ) these examples use a shared table of internet data records, ip data the columns and their data types in this table are shown here starttime (timestamp) endtime (timestamp) totalpackets (bigint) httphost (char) nullable httprequesturl (char) nullable httpstatuscode (smallint) nullable httpuseragent (char) nullable destinationport (int) sourceport (int) ipaddress (ipv4) ipprotocol (byte) location (point) optimize an exact match query using an inverted index this query uses the ip data table to analyze the location of a customer within an approximate time range for this use case, a secondary index helps optimize this query template select starttime, char(ipaddress), sourceport, destinationport, totalpackets from ip data where starttime between '\<start time>' and '\<end time>' and ipaddress = '\<ip address>' limit 10; the crucial column is the ipaddress , as it is the crux for the exact matching used in this query if data storage is not a problem, this query is ideal for an inverted index this index supports exact matching for the ipv4 data type of the ipaddress column create an inverted index on this column using this sql statement create index destination ipaddress on ip data (ipaddress) using inverted; optimize a range query using an inverted index similar to the exact match query, this query also includes a range filter for the numeric value in the totalpackets column select starttime, char(destinationipaddress), sourceport, destinationport, totalpackets from ip data where flowstarttime between '\<start time>' and '\<end time>' and ipaddress = '\<ip address>' and totalpackets between 200 and 400 limit 10; an inverted index is the best choice for this use case this index supports range filtering for all fixed length types, including the bigint values in the totalpackets column create an inverted index on this column using this sql statement create index totalpackets on ip data (totalpackets) using inverted; optimize an exact string match using a hash index in this case, filter all records where customers accessed the exact web address ocient com select starttime, char(sourceipaddress), char(ipaddress), sourceport, destinationport, httprequesturl from ip data where starttime between '\<start time>' and '\<end time>' and httprequesturl = 'ocient com' limit 10; if you expect to filter the httprequesturl column only with exact matches, the hash index is usually your best choice the hash index can greatly speed up exact matching on variable length data types, and it requires less storage than an ngram index except in rare edge cases create index httprequesturl on ip data (httprequesturl) using hash; optimize a partial string match using ngram index for this use case, filter all records where customers accessed a partial or exact web address the like clause in this query template filters all records for those customers who accessed the facebook url select starttime, char(sourceipaddress), char(ipaddress), sourceport, destinationport, httprequesturl from ip data where starttime between '\<start time>' and '\<end time>' and httprequesturl like '%facebook com%' limit 10; an ngram index is the best choice for this use case this index supports partial matching for variable length data types, including the varchar data type of the httprequesturl column create an ngram index on this column using this sql statement create index httprequesturl on ip data (httprequesturl) using ngram(3); in this example, the ngram index divides the specified column into substrings of three characters, which is the default this means partial strings such as "fac", "ace", "boo" and other variants are captured in the filter for this query you can specify a broader index by setting a higher substring value, such as ngram(4) or ngram(5) using one of these values would reduce the data storage requirements optimize a match query using a spatial index for this use case, filter all records where customers are located in a specific area at a specific time this query represents the target area by the bounding box polygon((1 2,1 3,1 4,1 2)) select starttime, char(sourceipaddress), char(ipaddress), sourceport, destinationport, httprequesturl, location from ip data where starttime between '\<start time>' and '\<end time>' and st contains(location, polygon((1 2,1 3,1 4,1 2))) limit 10; a spatial index is the best choice for this use case because it supports a variety of functions for filtering geospatial data for bounding box areas create a spatial index on the location column by using this sql statement create index location on ip data (location) using spatial; related links ingest data with legacy lat reference docid ul5nilgyd7y wjccg59d query ocient docid\ irpwvlqbnvw azqozjr0a database, tables, views, and indexes docid\ a4jvhkvg31tapexr9zpcq database, tables, views, and indexes docid\ a4jvhkvg31tapexr9zpcq related videos at the whiteboard with ocient indexing at hyperscale