Global Dictionary Compression
Global Dictionary Compression (GDC) is a feature of the database that compresses variable length column data using a dictionary encoder. Instead of storing the variable length data directly on disk, GDC will seamlessly substitute an integer that corresponds to each unique string.
- Reduced disk usage - A string like "GDC is really cool!" takes up 19 raw bytes. With GDC, it will take at most 4 bytes. This benefit is multiplied by the number of rows in the table and when applied to arrays can lead to dramatic reduction in required storage.
- Fixed-width - Variable length column data can be more difficult for performing selective I/O. It is much faster for to read the 100th element in a list of integers by offsetting 100 times the fixed width of the integer. With variable length types, a full traversal is required, counting how many elements are passed until the 100th is found. This can improve performance in some queries.
- Faster Joins - When joining on a GDC column, equality comparisons can be performed on the GDC integers instead of the variable length data. This can improve performance in some queries.
- Allows Variable-Length Cluster Keys - As of v19, GDC is the only method by which a variable length column can be used as a cluster key index.
- Increased complexity – GDC does add some configuration complexity. It requires a user to know the rough cardinality of their data to size the GDC number of bytes, and adds system configuration options.
- Load complexity – GDC compression adds some overhead to loading data into the Ocient system.
- Not suitable for high-cardinality data – If the variable length column has higher cardinality than this, it should not be stored using GDC.
To create a GDC column, add the COMPRESSION GDC(int) parameter after the column type specifier when creating a column, where int is one of 1, 2, or 4. The integer specifies the number of bytes to be used for the integer keys stored to disk. This also corresponds to the maximum number of unique keys to be stored in a particular column.
GDC(int) | Maximum Number of Unique Values |
---|---|
GDC(1) | 256 |
GDC(2) | 65,536 |
GDC(4) | 4,294,967,296 |
There is a soft limit of 1,000,000 keys even on 4 byte integer GDC columns. Contact Ocient support to evaluate your criteria for changing the limit and to understand the impact of raising limits. See Column Limit for details.
To enable GDC on a column, the keyword is applied in the CREATE TABLE or ALTER TABLE ADD COLUMN statement. An example of GDC on a VARCHAR column:
{column_name} VARCHAR(255) COMPRESSION GDC(2)
A user can also specify that a column should share GDC space with another column–possibly even a column of a different table. This would be useful if the same data is used in multiple columns, and it is often joined in queries. Use COMPRESSION GDC EXISTING schema.table.column_name as shown here:
{column_name} VARCHAR(255) GDC EXISTING {schema.table.column_name}
For arrays of variable length data, GDC will operate on the individual elements of the array. Specify compression after the overall array type:
{column_name} VARCHAR(255)[] COMPRESSION GDC(2)
Elements of tuple columns can be compressed with GDC. Specify compression on the specific type to be compressed:
{column_name} tuple<int, VARCHAR(255) COMPRESSION GDC(2)>
You can use the system catalog to inspect the number of keys used by different columns and the maximum count on each.
Example Query:
When a table with GDC is partially truncated, GDC key mappings for removed rows are not removed. This can result in stale mappings. The only way to remove unwanted mappings is to drop the column and recreate it.
To the end user, a table with GDC columns looks like any other table. However, When creating a table with GDC columns, the table configuration is different in system catalog tables. Instead of a table named schema.tablename, GDC tables leverage a built-in view.
This view and several pieces of metadata will be created in the system catalog:
- A view named schema.tablename is added in sys.views. This view is the user’s representation of the GDC table and will automatically convert the GDC keys to the loaded variable length data. This is the table that the user interacts with for querying data, making alterations, and granting/revoking access.
- For each GDC column in the table, a table named syslookup.schema_tablename_columnname is added in sys.tables. These tables store the mappings from strings to integers for each column. These tables should be interacted with only rarely.
- A table called sysgdc.schema_tablename is added in sys.tables. This is the table that is stored to disk, including all the non-GDC columns and the integers for each GDC column. This table should be interacted with only rarely.
GDC stores key mappings in a Raft consensus log that is maintained by all nodes operating with the Admin role. These mappings are created as the data is loaded, and used in queries to fetch the correct variable length data for the stored integers. Mappings can also be created by certain queries if the key does not already exist.
Because the Admin roles are responsible for maintaining the GDC key mappings, there must be a consensus of Admin roles available on the system in order to load new data that includes GDC columns.