Ensemble models improve predictive performance by combining the strengths of multiple individual machine learning models into a single, robust system. Instead of relying on one algorithm, ensemble techniques train a collection of base models and aggregate their predictions to reduce errors, lower variance, and prevent overfitting.Documentation Index
Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
Use this file to discover all available pages before exploring further.
Bagging
Model Type:BAGGING
A bagging model trains multiple base models in parallel on random subsets of the training data and aggregates their predictions. Use the bagging model to reduce variance and prevent overfitting.
The bagging model utilizes child models embedded as options in JSON format (see baseModels).
The bagging model trains all child models independently in parallel. When aggregating all the child model predictions, the bagging model uses majority voting for classification and averaging for regression.
A bagging model can use any model type that supports classification or regression.
Model Options
Required
baseModels — A JSON array defining the child models to train. Each object in the array specifies a model type, a count (how many of this model to create), and any options specific to the individual child model.
baseModels JSON Arguments
| Argument | Data Type | Description |
|---|---|---|
type | string | The model type (e.g., "DECISION TREE", "MULTIPLE LINEAR REGRESSION").The selected model type should comply with the selected taskType (i.e., select a classification model if your taskType is CLASSIFICATION. |
count | integer | The number of instances of this model type to train. |
options | JSON object | JSON object containing options specific to that child model type, exactly as if you were creating that model directly. For a list of supported options, see the descriptions of model types. |
taskType — Specifies the task type with values: 'CLASSIFICATION' or 'REGRESSION'.
Optional
maxThreads — An optional integer value that controls the maximum number of child models that the bagging model trains in parallel. This option does not affect the internal threading of each child. If supported, the maxThreads value of each child model controls the internal threading. The default value is 16.
bootstrap — An optional Boolean value. If you set this option to true, the model uses bootstrap sampling with replacement. Each child trains on a random sample where rows can be repeated. Requires you to set the noSnapshot option to false. If you set this option to false, the model uses sampling without replacement (each row appears at most one time per child). The default value is false.
rowsPerChild — An optional integer specifying the exact number of rows to use for each child model. 0 means use all available rows (only valid if you do not set the fractionSelected option). The default value is 0.
fractionSelected — An optional double value between 0.0 and 1.0 that specifies the proportion of rows to use for each child model. This option is unusable if you set the rowsPerChild option to greater than 0. The default value is 1.0 (use all rows).
inputsPerChild — An optional integer that specifies the number of features (columns) to use for each child model. The default value is CEIL(total_features / 3).
metrics — An optional Boolean value. If you set this option to true, the model calculates and stores quality metrics on the training data after training completes. The default value is false.
For classification, the quality metrics include:
- Accuracy: The percentage of training rows where the model predicted the correct class.
- Area Under the Curve (AUC): A score from 0 to 1 that measures how well the model distinguishes between two different categories.
- Root Mean Squared Error (RMSE): The square root of the average squared difference between the predicted and actual values.
- Adjusted R²**:** The coefficient of determination, adjusted for the number of predictors (features) in the model. This metric indicates how well the independent variables explain the variance in the dependent variable.
false.
featureArray — An optional Boolean value. If you set this option to true, the model expects only one array-type column as input, rather than multiple columns of training data. Each array row in the input column must be the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
Execute the Model
This example creates a bagging model that, in effect, functions as a random forest by training 50 decision trees for a classification task.SQL
SQL
Boosting
Model Type:BOOSTING
A boosting model is an ensemble technique that trains multiple base models sequentially, where each new model attempts to correct the errors of the combined previous models. Boosting reduces bias and errors in supervised learning tasks.
The boosting model utilizes child models specified as options in JSON format (see the baseModels argument). The boosting model trains its child models one after another, with each child contributing to the final prediction, weighted by a set learning rate.
A boosting model can use any model type that supports regression.
Model Options
Required
baseModels — A JSON array describing the child models to train. Each object in the array specifies a model type, the number of instances of this model to create, and other options that depend on the model type.
baseModels JSON Arguments
| Argument | Data Type | Description |
|---|---|---|
type | string | The OcientML model type (e.g., "DECISION TREE", "MULTIPLE LINEAR REGRESSION").The selected model type should comply with the selected taskType (i.e., select a classification model if your taskType is CLASSIFICATION. |
count | integer | The number of instances of this model type to train sequentially. |
options | JSON object | JSON object containing options specific to that child model type, similar to creating that model directly. For a list of supported options of a child model, see the individual descriptions of model types. |
taskType — Specifies the task type with values: 'CLASSIFICATION' or 'REGRESSION'.
learningRate — A decimal value between 0.0 and 1.0 that tunes how much the model learns from each successive child. Lower values can lead to better generalization, but they require more trees (a higher count value in the baseModels JSON field).
Optional
lossFunction — A string value that determines how the model calculates prediction errors and what type of problem it solves. Accepted values are:
'squared_error'— Configures the model for regression tasks. Calculates errors as the squared difference between predicted and actual values. Use this value for predicting continuous numeric values (e.g., prices, temperatures, quantities). The target column must contain numeric values. ThelossFunctionoption defaults tosquared_errorif you set thetaskTypeoption toREGRESSION.'log_loss'— Configures the model for classification tasks. This model uses logistic loss to calculate prediction errors for probability-based predictions. ThelossFunctionoption defaults tolog_lossif you set thetaskTypeoption toCLASSIFICATION.
fractionSelected — A double value specifying the proportion of rows (0.0 < n <= 1.0) to use for each child model. The default value is 1.0 (use all rows).
inputsPerChild — An integer value specifying the number of features (columns) to use for each child model. When you set this option, the algorithm deterministically cycles through feature subsets. If you do not set this option, the model uses all available features for each child.
maxThreads — An optional integer value that controls the maximum number of child models that the model trains in parallel. This option does not affect the internal threading of each child. If supported, the maxThreads value of each child model controls the internal threading. The default value is 16.
metrics — An optional Boolean value. If you set this option to true, this option calculates and stores quality metrics on the training data after training completes. The default value is false.
For classification, the quality metrics include:
- Accuracy: The percentage of training rows where the model predicted the correct class.
- Area Under the Curve (AUC): A score from 0 to 1 that measures how well the model distinguishes between two different categories.
- Root Mean Squared Error (RMSE): The square root of the average squared difference between the predicted and actual values.
- Adjusted R²**:** The coefficient of determination, adjusted for the number of predictors (features) in the model. This metric indicates how well the independent variables explain the variance in the dependent variable.
skipDropTable — An optional Boolean value. If you set this option to false, the database deletes intermediate tables created during training. If you set this option to true, the database prevents deletion of intermediate tables (useful for debugging). The default value is false.
continuousFeatures — An optional comma-separated list of feature indexes (1-based) that are continuous numeric variables (i.e., not categorical or discrete variables).
featureArray — An optional Boolean value. If you set this option to true, the model expects only one array-type column as input, rather than multiple columns of training data. Each array row in the input column must be the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
Execute the Model
This example creates a boosting model (similar to Gradient Boosted Trees but using generic boosting logic) that trains 100 decision trees sequentially for a regression task.SQL
SQL
Stacking
Model Type:STACKING
The stacking model combines multiple base models into a single, higher‑accuracy ensemble. A stacking model can use any other machine learning model that supports regression, classification, or clustering.
A stacking model can produce better accuracy than a single model by incorporating the strengths of different models and feature subsets.
Stacking trains two model levels:
- Level‑zero models: one or more base models that train on the original features.
- Level‑one model (meta‑model): a single model that trains on the predictions of the level‑zero models (plus any preserved features), and learns how to weight and combine them.
A stacking model cannot use the Vector Autoregression, Feedforward Neural Network, or Association Rules models as level-zero or level-one models.
Model Options
Required
The stacking model utilizes other models embedded as options in JSON format (seelevelZeroModels and levelOneModel). Both are required.
levelZeroModels — An array of JSON objects representing one or more base models. Internally, the system maps these models using the same strings as other machine learning model types. Any stacked models have the same features, requirements, and options as they would without stacking.
levelOneModel — A JSON object that defines the single meta‑model that sits on top of all level‑zero models in a stacking ensemble. The object specifies the model type, name, and any model‑specific options for this level‑one model, which is trained on the outputs of the level‑zero models (and any preserved input columns) to produce the final prediction.
JSON strings for both levelZeroModels and levelOneModel share these arguments unless noted otherwise in the description.
| Argument | Data Type | Description |
|---|---|---|
type | string | The machine learning model type to use for this child (for example, "RANDOM FOREST", "LOGISTIC REGRESSION", "KMEANS"). Must be a valid OcientML model type that supports regression, classification, or clustering. |
name | string | Optional. A name for the child model. If you omit this argument, the system auto‑generates a name by attaching a suffix to the parent model name. |
options | JSON object | JSON object containing options specific to the chosen model type, exactly as if you were creating that model directly. For a list of options, see the individual descriptions of individual model types. |
ignoreColumns | string | Optional. Used only for levelZeroModels. The value is a comma-separated list of 1-based column indices to exclude from this level‑0 model training input. If you omit this argument, the model sees all available features, labels, and extra columns. |
extraCallArguments | string | Advanced optional argument. An array of argument lists. Each element is an array of strings representing extra literal arguments appended when the system executes this child model in the level‑one training SQL. Level-zero child models can use any number of argument lists. Level‑one child models can use only one argument list. |
Optional
maxThreads — An optional integer value that controls the maximum number of level‑0 child models that the stacking model trains in parallel. This option does not affect the internal threading of each child. If supported, the maxThreads value of each child model controls the internal threading. A higher maxThreads value can reduce total training time for many level‑0 models, but increases concurrent resource usage on the cluster. The default value is 16.
noSnapshot — An optional Boolean value that controls whether the stacking trainer first materializes the training query into an intermediate snapshot table. When you set the noSnapshot option to false, the model creates a temporary snapshot table from the input SQL and trains all level‑0 and level‑1 models against that fixed data set, ensuring consistent data even if the source changes. When you set the noSnapshot option to true, the trainer reads the input query directly without creating a snapshot (which can be faster but requires the underlying data not to change).
If the query is a common table expression (i.e., it starts with a WITH clause), then stacking automatically forces the noSnapshot option value to false and logs a warning.
The default value is false.
featureArray — An optional Boolean value. If you set this option to true, the model expects only one array-type column as input, rather than multiple columns of training data. Each array row in the input column must be the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
hasLabelColumn — Specifies whether the training data includes a label column (necessary for supervised tasks like classification or regression). If you set this option to true, the input query must return features followed by a label column. If you set this option to false, the model assumes the input contains only features (typically for unsupervised tasks, such as clustering or dimension reduction models). The default value is true.
extraColumnCount — A non-negative integer representing the number of columns present after the label column (e.g., for weights). Level-zero models do not use these columns as standard features, but you can pass these columns to the level-one model using the extraCallArguments argument logic. The default value is 0.
preservedColumnsForLevelOne — A list of integers representing column indices (1-based) from the input data. The stacking model passes these specified columns to the level-one model. This option is useful if the level-one model needs to access raw features to improve the base model predictions.
If you do not specify this option, the default value is an empty list (i.e., no original features are preserved).
Execute the Model
This example demonstrates how a stacking model can embed level-zero and level-one child models. The JSONOPTIONS string includes two level-zero models (random forest and gradient boosted trees) and produces the final prediction using the level-one model (logistic regression). Each models specifies its own options for how to run.
In the example, the random forest model (rf_base) uses 200 decision trees. This model explicitly ignores columns 3 and 4, training only on columns 1 and 2. In contrast, the gradient boosted trees model (gbt_base) uses 300 decision trees and all available columns.
The level-one model uses logistic regression to combine the outputs of the two base models. The model takes the predictions from rf_base and gbt_base as its inputs and produces the final classification result.
When you run the stacking model, the stacking model first executes the base models in parallel (up to the maxThreads value). The resulting predictions then go into the logistic regression meta-model to generate the final score.
SQL
SQL

