Join Optimization in Ocient
Before starting any join operation, the Ocient System automatically determines the optimal plan for combining tables. The system determines the best order for joining tables to reduce processing and storage demands while increasing performance. This process includes:- Optimizing I/O with Early Filtering — The execution engine uses various techniques to dynamically reduce input and output (I/O) for join operations by filtering unneeded values.
- Determining join order — The optimizer automatically reorders tables for efficiency, usually from smallest to largest in terms of row count, regardless of how you order tables in the SQL statement.
- Selecting the join strategy — The system can override user-specified join types with more performant alternatives. For example, the system can convert an
OUTER JOINto anINNER JOINwhen the result remains semantically equivalent.
JOIN queries or configure table definitions to designate a specific table order. The optimizer ensures efficient execution plans even if you do not specify optimal table or join conditions.
Join performance does not generally benefit from other table configurations for query optimization, such as , Clustering Keys, secondary indexes, or compression.
How Ocient Joins Tables
The Ocient System executes allJOIN operations within SQL statements as either hash or product joins:
Hash Joins
- The system uses hash joins automatically whenever your
JOINpredicate includes at least one equality comparison (e.g.,ON a.column_1 = b.column_2). - The system splits and shuffles data so the hash join executes in parallel. This shuffling enables great scalability and performance, even with large tables.
- If your
JOINpredicate does not include an equality condition, the system uses a product join (a cartesian join with a filter). For example, theJOINpredicateON a.column_1 > b.column_2uses a product join because it is not an equality condition. - In a product join, the system broadcasts the smaller table of the join to all compute nodes, and then compares every row according to your condition.
- Product joins perform more slowly when operating on large tables, so it is generally better to use an equality comparison in your
JOINSQL statements when possible.
Join Storage Considerations
The performance of SQL join operations is limited by the available system memory, specifically the size of the huge pages configured on the server. During the execution of hash joins, the system typically loads the smaller table of the join into memory. Spill to Disk If the memory required for the join operation exceeds what is physically available, the system spills intermediate data to disk, utilizing swap or temporary files. This reliance on disk I/O can significantly degrade query performance compared to in-memory processing, but it ensures that the system can complete the query without failure. Temporary Storage Requirements When intermediate results or hash tables exceed available memory, the join operation adjusts to using temporary disk space approximately equal to the combined sizes of the participating tables (both the build and probe sides). Resource Contention Because joins are resource-intensive, heavy join workloads can affect overall system throughput and potentially impact the performance of concurrent queries, particularly in systems without strict workload management configurations.Join Best Practices
These best practices help you design join queries that balance performance, scalability, and maintainability. While the Ocient optimizer handles most join operations automatically, applying these techniques can further reduce overhead and improve query efficiency. Many of these tips are common for OLAP database administration and reflect proven approaches across analytical systems.Moderate Denormalization
In columnar analytical databases, moderate denormalization is a recommended best practice. This denormalization means selectively embedding low-cardinality, frequently used dimension attributes into a fact table to reduce the number of joins. For example, duplicating a product category column in a sales fact table can remove the need for a separate join to the product dimension, while still leaving high-cardinality tables normalized. Small dimension tables (less than 1 million rows) can also benefit from Global Dictionary Compression (GDC). With GDC, the system can share dictionaries across tables, improving both performance and storage efficiency.Query Design Considerations
Beyond general join strategies, small design choices in your SQL statements can impact performance significantly. These practices help ensure that the optimizer can make efficient decisions and avoid unnecessary overhead during execution. Use Window Aggregate Functions Instead of Self-Joins In many cases, window aggregate functions can replace self-joins, especially for time period comparisons. Window Aggregate Example (Most Performant) In this example, theLAG function retrieves the revenue from the previous month within the same result set, allowing you to calculate month-over-month changes without requiring a self-join.
SQL
monthly_sales table twice, once as curr and again as prev. This approach produces the same result as the window function but requires more overhead.
SQL
SQL

