Ensemble Models - Ocient Documentation

Ensemble models improve predictive performance by combining the strengths of multiple individual machine learning models into a single, robust system. Instead of relying on one algorithm, ensemble techniques train a collection of base models and aggregate their predictions to reduce errors, lower variance, and prevent overfitting.

Bagging

Model Type: BAGGING A bagging model trains multiple base models in parallel on random subsets of the training data and aggregates their predictions. Use the bagging model to reduce variance and prevent overfitting. The bagging model utilizes child models embedded as options in JSON format (see baseModels). The bagging model trains all child models independently in parallel. When aggregating all the child model predictions, the bagging model uses majority voting for classification and averaging for regression.

A bagging model can use any model type that supports classification or regression.

Model Options

Required

baseModels — A JSON array defining the child models to train. Each object in the array specifies a model type, a count (how many of this model to create), and any options specific to the individual child model. baseModels JSON Arguments

Argument	Data Type	Description
`type`	string	The model type (e.g., `"DECISION TREE"`, `"MULTIPLE LINEAR REGRESSION"`). The selected model type should comply with the selected `taskType` (i.e., select a classification model if your `taskType` is `CLASSIFICATION`.
`count`	integer	The number of instances of this model type to train.
`options`	JSON object	JSON object containing options specific to that child model type, exactly as if you were creating that model directly. For a list of supported options, see the descriptions of model types.

taskType — Specifies the task type with values: 'CLASSIFICATION' or 'REGRESSION'.

Optional

maxThreads — An optional integer value that controls the maximum number of child models that the bagging model trains in parallel. This option does not affect the internal threading of each child. If supported, the maxThreads value of each child model controls the internal threading. The default value is 16. bootstrap — An optional Boolean value. If you set this option to true, the model uses bootstrap sampling with replacement. Each child trains on a random sample where rows can be repeated. Requires you to set the noSnapshot option to false. If you set this option to false, the model uses sampling without replacement (each row appears at most one time per child). The default value is false. rowsPerChild — An optional integer specifying the exact number of rows to use for each child model. 0 means use all available rows (only valid if you do not set the fractionSelected option). The default value is 0. fractionSelected — An optional double value between 0.0 and 1.0 that specifies the proportion of rows to use for each child model. This option is unusable if you set the rowsPerChild option to greater than 0. The default value is 1.0 (use all rows). inputsPerChild — An optional integer that specifies the number of features (columns) to use for each child model. The default value is CEIL(total_features / 3). metrics — An optional Boolean value. If you set this option to true, the model calculates and stores quality metrics on the training data after training completes. The default value is false. For classification, the quality metrics include:

Accuracy: The percentage of training rows where the model predicted the correct class.
Area Under the Curve (AUC): A score from 0 to 1 that measures how well the model distinguishes between two different categories.

For regression, the quality metrics include:

Root Mean Squared Error (RMSE): The square root of the average squared difference between the predicted and actual values.
Adjusted R²**:** The coefficient of determination, adjusted for the number of predictors (features) in the model. This metric indicates how well the independent variables explain the variance in the dependent variable.

The default value is false. featureArray — An optional Boolean value. If you set this option to true, the model expects only one array-type column as input, rather than multiple columns of training data. Each array row in the input column must be the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.

Execute the Model

This example creates a bagging model that, in effect, functions as a random forest by training 50 decision trees for a classification task.

SQL

CREATE MLMODEL my_schema.bagging_classifier
TYPE BAGGING
OPTIONS (
    'taskType' = 'CLASSIFICATION',
    'baseModels' = '[
        {
            "type": "DECISION TREE",
            "count": 50,
            "options": {
                "maxDepth": 8,
                "splitMetric": "gini_impurity"
            }
        }
    ]',
    'bootstrap' = 'true',
    'fractionSelected' = '0.8',
    'maxThreads' = '16',
    'metrics' = 'true'
)
AS
SELECT
    feature1,
    feature2,
    feature3,
    label
FROM my_schema.training_data;

After the training completes, execute the bagging model.

SQL

SELECT
    my_schema.bagging_classifier(feature1, feature2, feature3)
FROM my_schema.scoring_data;

Boosting

Model Type: BOOSTING A boosting model is an ensemble technique that trains multiple base models sequentially, where each new model attempts to correct the errors of the combined previous models. Boosting reduces bias and errors in supervised learning tasks. The boosting model utilizes child models specified as options in JSON format (see the baseModels argument). The boosting model trains its child models one after another, with each child contributing to the final prediction, weighted by a set learning rate.

A boosting model can use any model type that supports regression.

Model Options

Required

baseModels — A JSON array describing the child models to train. Each object in the array specifies a model type, the number of instances of this model to create, and other options that depend on the model type. baseModels JSON Arguments

Argument	Data Type	Description
`type`	string	The OcientML model type (e.g., `"DECISION TREE"`, `"MULTIPLE LINEAR REGRESSION"`). The selected model type should comply with the selected `taskType` (i.e., select a classification model if your `taskType` is `CLASSIFICATION`.
`count`	integer	The number of instances of this model type to train sequentially.
`options`	JSON object	JSON object containing options specific to that child model type, similar to creating that model directly. For a list of supported options of a child model, see the individual descriptions of model types.

taskType — Specifies the task type with values: 'CLASSIFICATION' or 'REGRESSION'. learningRate — A decimal value between 0.0 and 1.0 that tunes how much the model learns from each successive child. Lower values can lead to better generalization, but they require more trees (a higher count value in the baseModels JSON field).

Optional

lossFunction — A string value that determines how the model calculates prediction errors and what type of problem it solves. Accepted values are:

'squared_error' — Configures the model for regression tasks. Calculates errors as the squared difference between predicted and actual values. Use this value for predicting continuous numeric values (e.g., prices, temperatures, quantities). The target column must contain numeric values. The lossFunction option defaults to squared_error if you set the taskType option to REGRESSION.
'log_loss' — Configures the model for classification tasks. This model uses logistic loss to calculate prediction errors for probability-based predictions. The lossFunction option defaults to log_loss if you set the taskType option to CLASSIFICATION.

fractionSelected — A double value specifying the proportion of rows (0.0 < n <= 1.0) to use for each child model. The default value is 1.0 (use all rows). inputsPerChild — An integer value specifying the number of features (columns) to use for each child model. When you set this option, the algorithm deterministically cycles through feature subsets. If you do not set this option, the model uses all available features for each child. maxThreads — An optional integer value that controls the maximum number of child models that the model trains in parallel. This option does not affect the internal threading of each child. If supported, the maxThreads value of each child model controls the internal threading. The default value is 16. metrics — An optional Boolean value. If you set this option to true, this option calculates and stores quality metrics on the training data after training completes. The default value is false. For classification, the quality metrics include:

Accuracy: The percentage of training rows where the model predicted the correct class.
Area Under the Curve (AUC): A score from 0 to 1 that measures how well the model distinguishes between two different categories.

For regression, the quality metrics include:

Root Mean Squared Error (RMSE): The square root of the average squared difference between the predicted and actual values.
Adjusted R²**:** The coefficient of determination, adjusted for the number of predictors (features) in the model. This metric indicates how well the independent variables explain the variance in the dependent variable.

skipDropTable — An optional Boolean value. If you set this option to false, the database deletes intermediate tables created during training. If you set this option to true, the database prevents deletion of intermediate tables (useful for debugging). The default value is false. continuousFeatures — An optional comma-separated list of feature indexes (1-based) that are continuous numeric variables (i.e., not categorical or discrete variables). featureArray — An optional Boolean value. If you set this option to true, the model expects only one array-type column as input, rather than multiple columns of training data. Each array row in the input column must be the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.

Execute the Model

This example creates a boosting model (similar to Gradient Boosted Trees but using generic boosting logic) that trains 100 decision trees sequentially for a regression task.

SQL

CREATE MLMODEL my_schema.boosting_regressor
TYPE BOOSTING
OPTIONS (
    'taskType' = 'REGRESSION',
    'baseModels' = '[
        {
            "type": "DECISION TREE",
            "count": 100,
            "options": {
                "maxDepth": 4
            }
        }
    ]',
    'learningRate' = '0.1',
    'lossFunction' = 'squared_error',
    'metrics' = 'true'
)
AS
SELECT
    feature1,
    feature2,
    feature3,
    label
FROM my_schema.training_data;

After training completes, execute the boosting model.

SQL

SELECT
    my_schema.boosting_regressor(feature1, feature2, feature3)
FROM my_schema.scoring_data;

Stacking

Model Type: STACKING The stacking model combines multiple base models into a single, higher‑accuracy ensemble. A stacking model can use any other machine learning model that supports regression, classification, or clustering. A stacking model can produce better accuracy than a single model by incorporating the strengths of different models and feature subsets. Stacking trains two model levels:

Level‑zero models: one or more base models that train on the original features.
Level‑one model (meta‑model): a single model that trains on the predictions of the level‑zero models (plus any preserved features), and learns how to weight and combine them.

A stacking model cannot use the Vector Autoregression, Feedforward Neural Network, or Association Rules models as level-zero or level-one models.

Model Options

Required

The stacking model utilizes other models embedded as options in JSON format (see levelZeroModels and levelOneModel). Both are required. levelZeroModels — An array of JSON objects representing one or more base models. Internally, the system maps these models using the same strings as other machine learning model types. Any stacked models have the same features, requirements, and options as they would without stacking. levelOneModel — A JSON object that defines the single meta‑model that sits on top of all level‑zero models in a stacking ensemble. The object specifies the model type, name, and any model‑specific options for this level‑one model, which is trained on the outputs of the level‑zero models (and any preserved input columns) to produce the final prediction. JSON strings for both levelZeroModels and levelOneModel share these arguments unless noted otherwise in the description.

Argument	Data Type	Description
`type`	string	The machine learning model type to use for this child (for example, `"RANDOM FOREST"`, `"LOGISTIC REGRESSION"`, `"KMEANS"`). Must be a valid OcientML model type that supports regression, classification, or clustering.
`name`	string	Optional. A name for the child model. If you omit this argument, the system auto‑generates a name by attaching a suffix to the parent model name.
`options`	JSON object	JSON object containing options specific to the chosen model type, exactly as if you were creating that model directly. For a list of options, see the individual descriptions of individual model types.
`ignoreColumns`	string	Optional. Used only for `levelZeroModels`. The value is a comma-separated list of 1-based column indices to exclude from this level‑0 model training input. If you omit this argument, the model sees all available features, labels, and extra columns.
`extraCallArguments`	string	Advanced optional argument. An array of argument lists. Each element is an array of strings representing extra literal arguments appended when the system executes this child model in the level‑one training SQL. Level-zero child models can use any number of argument lists. Level‑one child models can use only one argument list.

Optional

maxThreads — An optional integer value that controls the maximum number of level‑0 child models that the stacking model trains in parallel. This option does not affect the internal threading of each child. If supported, the maxThreads value of each child model controls the internal threading. A higher maxThreads value can reduce total training time for many level‑0 models, but increases concurrent resource usage on the cluster. The default value is 16. noSnapshot — An optional Boolean value that controls whether the stacking trainer first materializes the training query into an intermediate snapshot table. When you set the noSnapshot option to false, the model creates a temporary snapshot table from the input SQL and trains all level‑0 and level‑1 models against that fixed data set, ensuring consistent data even if the source changes. When you set the noSnapshot option to true, the trainer reads the input query directly without creating a snapshot (which can be faster but requires the underlying data not to change). If the query is a common table expression (i.e., it starts with a WITH clause), then stacking automatically forces the noSnapshot option value to false and logs a warning. The default value is false. featureArray — An optional Boolean value. If you set this option to true, the model expects only one array-type column as input, rather than multiple columns of training data. Each array row in the input column must be the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false. hasLabelColumn — Specifies whether the training data includes a label column (necessary for supervised tasks like classification or regression). If you set this option to true, the input query must return features followed by a label column. If you set this option to false, the model assumes the input contains only features (typically for unsupervised tasks, such as clustering or dimension reduction models). The default value is true. extraColumnCount — A non-negative integer representing the number of columns present after the label column (e.g., for weights). Level-zero models do not use these columns as standard features, but you can pass these columns to the level-one model using the extraCallArguments argument logic. The default value is 0. preservedColumnsForLevelOne — A list of integers representing column indices (1-based) from the input data. The stacking model passes these specified columns to the level-one model. This option is useful if the level-one model needs to access raw features to improve the base model predictions. If you do not specify this option, the default value is an empty list (i.e., no original features are preserved).

Execute the Model

This example demonstrates how a stacking model can embed level-zero and level-one child models. The JSON OPTIONS string includes two level-zero models (random forest and gradient boosted trees) and produces the final prediction using the level-one model (logistic regression). Each models specifies its own options for how to run. In the example, the random forest model (rf_base) uses 200 decision trees. This model explicitly ignores columns 3 and 4, training only on columns 1 and 2. In contrast, the gradient boosted trees model (gbt_base) uses 300 decision trees and all available columns. The level-one model uses logistic regression to combine the outputs of the two base models. The model takes the predictions from rf_base and gbt_base as its inputs and produces the final classification result. When you run the stacking model, the stacking model first executes the base models in parallel (up to the maxThreads value). The resulting predictions then go into the logistic regression meta-model to generate the final score.

SQL

CREATE MLMODEL my_schema.stacking_classifier
TYPE STACKING
OPTIONS (
    -- Level-0 (Base) Models
    'levelZeroModels' = '[
        {
            "type": "RANDOM FOREST",
            "name": "rf_base",
            "options": {
                "numTrees": 200,
                "maxDepth": 12
            },
            "ignoreColumns": "3,4"
        },
        {
            "type": "GRADIENT BOOSTED TREES",
            "name": "gbt_base",
            "options": {
                "numChildren": 300,
                "learningRate": 0.05
            }
        }
    ]',

    -- Level-1 (Meta) Model
    'levelOneModel' = '{
        "type": "LOGISTIC REGRESSION",
        "name": "stacking_meta",
        "options": {
            "maxIterations": 200,
            "regularization": 0.1
        }
    }',

    -- Top-level STACKING options
    'maxThreads' = '16',
    'hasLabelColumn' = 'true'
)
AS
SELECT
    feature1,
    feature2,
    feature3,
    feature4,
    label
FROM my_schema.training_data;

After training completes, execute the stacking model.

SQL

SELECT
    my_schema.stacking_classifier(feature1, feature2, feature3, feature4)
FROM my_schema.scoring_data;

Regression Models Classification Models Clustering and Dimension Reduction Models

OcientAIQ Unified Data Platform

Documentation Index

​Bagging

​Model Options

​Required

​Optional

​Execute the Model

​Boosting

​Model Options

​Required

​Optional

​Execute the Model

​Stacking

​Model Options

​Required

​Optional

​Execute the Model

​Related Links

Bagging

Model Options

Required

Optional

Execute the Model

Boosting

Model Options

Required

Optional

Execute the Model

Stacking

Model Options

Required

Optional

Execute the Model

Related Links