Ensemble Models
ensemble models improve predictive performance by combining the strengths of multiple individual machine learning models into a single, robust system instead of relying on one algorithm, ensemble techniques train a collection of base models and aggregate their predictions to reduce errors, lower variance, and prevent overfitting bagging model type bagging a bagging model trains multiple base models in parallel on random subsets of the training data and aggregates their predictions use the bagging model to reduce variance and prevent overfitting the bagging model utilizes child models embedded as options in json format (see basemodels ) the bagging model trains all child models independently in parallel when aggregating all the child model predictions, the bagging model uses majority voting for classification and averaging for regression a bagging model can use any model type that supports classification or regression model options required basemodels — a json array defining the child models to train each object in the array specifies a model type, a count (how many of this model to create), and any options specific to the individual child model basemodels json arguments argument data type description type string the model type (e g , "decision tree" , "multiple linear regression" ) the selected model type should comply with the selected tasktype (i e , select a classification model if your tasktype is classification count integer the number of instances of this model type to train options json object json object containing options specific to that child model type, exactly as if you were creating that model directly for a list of supported options, see the descriptions of model types tasktype — specifies the task type with values 'classification' or 'regression' optional maxthreads — an optional integer value that controls the maximum number of child models that the bagging model trains in parallel this option does not affect the internal threading of each child if supported, the maxthreads value of each child model controls the internal threading the default value is 16 bootstrap — an optional boolean value if you set this option to true , the model uses bootstrap sampling with replacement each child trains on a random sample where rows can be repeated requires you to set the nosnapshot option to false if you set this option to false , the model uses sampling without replacement (each row appears at most one time per child) the default value is false rowsperchild — an optional integer specifying the exact number of rows to use for each child model 0 means use all available rows (only valid if you do not set the fractionselected option) the default value is 0 fractionselected — an optional double value between 0 0 and 1 0 that specifies the proportion of rows to use for each child model this option is unusable if you set the rowsperchild option to greater than 0 the default value is 1 0 (use all rows) inputsperchild — an optional integer that specifies the number of features (columns) to use for each child model the default value is ceil(total features / 3) metrics — an optional boolean value if you set this option to true , the model calculates and stores quality metrics on the training data after training completes the default value is false for classification, the quality metrics include accuracy the percentage of training rows where the model predicted the correct class area under the curve (auc) a score from 0 to 1 that measures how well the model distinguishes between two different categories for regression, the quality metrics include root mean squared error (rmse) the square root of the average squared difference between the predicted and actual values adjusted r² the coefficient of determination, adjusted for the number of predictors (features) in the model this metric indicates how well the independent variables explain the variance in the dependent variable the default value is false featurearray — an optional boolean value if you set this option to true , the model expects only one array type column as input, rather than multiple columns of training data each array row in the input column must be the same size regardless of whether you use this option, the model treats training data the same with labeling and weight scoring the default value is false execute the model this example creates a bagging model that, in effect, functions as a random forest by training 50 decision trees for a classification task create mlmodel my schema bagging classifier type bagging options ( 'tasktype' = 'classification', 'basemodels' = '\[ { "type" "decision tree", "count" 50, "options" { "maxdepth" 8, "splitmetric" "gini impurity" } } ]', 'bootstrap' = 'true', 'fractionselected' = '0 8', 'maxthreads' = '16', 'metrics' = 'true' ) as select feature1, feature2, feature3, label from my schema training data; after the training completes, execute the bagging model select my schema bagging classifier(feature1, feature2, feature3) from my schema scoring data; boosting model type boosting a boosting model is an ensemble technique that trains multiple base models sequentially, where each new model attempts to correct the errors of the combined previous models boosting reduces bias and errors in supervised learning tasks the boosting model utilizes child models specified as options in json format (see the basemodels argument) the boosting model trains its child models one after another, with each child contributing to the final prediction, weighted by a set learning rate a boosting model can use any model type that supports regression model options required basemodels — a json array describing the child models to train each object in the array specifies a model type, the number of instances of this model to create, and other options that depend on the model type basemodels json arguments argument data type description type string the ocientml model type (e g , "decision tree" , "multiple linear regression" ) the selected model type should comply with the selected tasktype (i e , select a classification model if your tasktype is classification count integer the number of instances of this model type to train sequentially options json object json object containing options specific to that child model type, similar to creating that model directly for a list of supported options of a child model, see the individual descriptions of model types tasktype — specifies the task type with values 'classification' or 'regression' learningrate — a decimal value between 0 0 and 1 0 that tunes how much the model learns from each successive child lower values can lead to better generalization, but they require more trees (a higher count value in the basemodels json field) optional lossfunction — a string value that determines how the model calculates prediction errors and what type of problem it solves accepted values are 'squared error' — configures the model for regression tasks calculates errors as the squared difference between predicted and actual values use this value for predicting continuous numeric values (e g , prices, temperatures, quantities) the target column must contain numeric values the lossfunction option defaults to squared error if you set the tasktype option to regression 'log loss' — configures the model for classification tasks this model uses logistic loss to calculate prediction errors for probability based predictions the lossfunction option defaults to log loss if you set the tasktype option to classification fractionselected — a double value specifying the proportion of rows (0 0 < n <= 1 0) to use for each child model the default value is 1 0 (use all rows) inputsperchild — an integer value specifying the number of features (columns) to use for each child model when you set this option, the algorithm deterministically cycles through feature subsets if you do not set this option, the model uses all available features for each child maxthreads — an optional integer value that controls the maximum number of child models that the model trains in parallel this option does not affect the internal threading of each child if supported, the maxthreads value of each child model controls the internal threading the default value is 16 metrics — an optional boolean value if you set this option to true , this option calculates and stores quality metrics on the training data after training completes the default value is false for classification, the quality metrics include accuracy the percentage of training rows where the model predicted the correct class area under the curve (auc) a score from 0 to 1 that measures how well the model distinguishes between two different categories for regression, the quality metrics include root mean squared error (rmse) the square root of the average squared difference between the predicted and actual values adjusted r² the coefficient of determination, adjusted for the number of predictors (features) in the model this metric indicates how well the independent variables explain the variance in the dependent variable skipdroptable — an optional boolean value if you set this option to false , the database deletes intermediate tables created during training if you set this option to true , the database prevents deletion of intermediate tables (useful for debugging) the default value is false continuousfeatures — an optional comma separated list of feature indexes (1 based) that are continuous numeric variables (i e , not categorical or discrete variables) featurearray — an optional boolean value if you set this option to true , the model expects only one array type column as input, rather than multiple columns of training data each array row in the input column must be the same size regardless of whether you use this option, the model treats training data the same with labeling and weight scoring the default value is false execute the model this example creates a boosting model (similar to gradient boosted trees but using generic boosting logic) that trains 100 decision trees sequentially for a regression task create mlmodel my schema boosting regressor type boosting options ( 'tasktype' = 'regression', 'basemodels' = '\[ { "type" "decision tree", "count" 100, "options" { "maxdepth" 4 } } ]', 'learningrate' = '0 1', 'lossfunction' = 'squared error', 'metrics' = 'true' ) as select feature1, feature2, feature3, label from my schema training data; after training completes, execute the boosting model select my schema boosting regressor(feature1, feature2, feature3) from my schema scoring data; stacking model type stacking the stacking model combines multiple base models into a single, higher‑accuracy ensemble a stacking model can use any other machine learning model that supports regression, classification, or clustering a stacking model can produce better accuracy than a single model by incorporating the strengths of different models and feature subsets stacking trains two model levels level‑zero models one or more base models that train on the original features level‑one model (meta‑model) a single model that trains on the predictions of the level‑zero models (plus any preserved features), and learns how to weight and combine them a stacking model cannot use the vector autoregression, feedforward neural network, or association rules models as level zero or level one models model options required the stacking model utilizes other models embedded as options in json format (see levelzeromodels and levelonemodel ) both are required levelzeromodels — an array of json objects representing one or more base models internally, the system maps these models using the same strings as other machine learning model types any stacked models have the same features, requirements, and options as they would without stacking levelonemodel — a json object that defines the single meta‑model that sits on top of all level‑zero models in a stacking ensemble the object specifies the model type, name, and any model‑specific options for this level‑one model, which is trained on the outputs of the level‑zero models (and any preserved input columns) to produce the final prediction json strings for both levelzeromodels and levelonemodel share these arguments unless noted otherwise in the description argument data type description type string the machine learning model type to use for this child (for example, "random forest" , "logistic regression" , "kmeans" ) must be a valid ocientml model type that supports regression, classification, or clustering name string optional a name for the child model if you omit this argument, the system auto‑generates a name by attaching a suffix to the parent model name options json object json object containing options specific to the chosen model type, exactly as if you were creating that model directly for a list of options, see the individual descriptions of individual model types ignorecolumns string optional used only for levelzeromodels the value is a comma separated list of 1 based column indices to exclude from this level‑0 model training input if you omit this argument, the model sees all available features, labels, and extra columns extracallarguments string advanced optional argument an array of argument lists each element is an array of strings representing extra literal arguments appended when the system executes this child model in the level‑one training sql level zero child models can use any number of argument lists level‑one child models can use only one argument list optional maxthreads — an optional integer value that c ontrols the maximum number of level‑0 child models that the stacking model trains in parallel this option does not affect the internal threading of each child if supported, the maxthreads value of each child model controls the internal threading a higher maxthreads value can reduce total training time for many level‑0 models, but increases concurrent resource usage on the cluster the default value is 16 nosnapshot — an optional boolean value that controls whether the stacking trainer first materializes the training query into an intermediate snapshot table when you set the nosnapshot option to false , the model creates a temporary snapshot table from the input sql and trains all level‑0 and level‑1 models against that fixed data set, ensuring consistent data even if the source changes when you set the nosnapshot option to true , the trainer reads the input query directly without creating a snapshot (which can be faster but requires the underlying data not to change) if the query is a common table expression (i e , it starts with a with clause), then stacking automatically forces the nosnapshot option value to false and logs a warning the default value is false featurearray — an optional boolean value if you set this option to true , the model expects only one array type column as input, rather than multiple columns of training data each array row in the input column must be the same size regardless of whether you use this option, the model treats training data the same with labeling and weight scoring the default value is false haslabelcolumn — s pecifies whether the training data includes a label column (necessary for supervised tasks like classification or regression) if you set this option to true , the input query must return features followed by a label column if you set this option to false , the model assumes the input contains only features (typically for unsupervised tasks, such as clustering or dimension reduction models) the default value is true extracolumncount — a n on negative integer representing the number of columns present after the label column (e g , for weights) level zero models do not use these columns as standard features, but you can pass these columns to the level one model using the extracallarguments argument logic the default value is 0 preservedcolumnsforlevelone — a list of integers representing column indices (1 based) from the input data the stacking model passes these specified columns to the level one model this option is useful if the level one model needs to access raw features to improve the base model predictions if you do not specify this option, the default value is an empty list (i e , no original features are preserved) execute the model this example demonstrates how a stacking model can embed level zero and level one child models the json options string includes two level zero models (random forest and gradient boosted trees) and produces the final prediction using the level one model (logistic regression) each models specifies its own options for how to run in the example, the random forest model ( rf base ) uses 200 decision trees this model explicitly ignores columns 3 and 4, training only on columns 1 and 2 in contrast, the gradient boosted trees model ( gbt base ) uses 300 decision trees and all available columns the level one model uses logistic regression to combine the outputs of the two base models the model takes the predictions from rf base and gbt base as its inputs and produces the final classification result when you run the stacking model, the stacking model first executes the base models in parallel (up to the maxthreads value) the resulting predictions then go into the logistic regression meta model to generate the final score create mlmodel my schema stacking classifier type stacking options ( \ level 0 (base) models 'levelzeromodels' = '\[ { "type" "random forest", "name" "rf base", "options" { "numtrees" 200, "maxdepth" 12 }, "ignorecolumns" "3,4" }, { "type" "gradient boosted trees", "name" "gbt base", "options" { "numchildren" 300, "learningrate" 0 05 } } ]', \ level 1 (meta) model 'levelonemodel' = '{ "type" "logistic regression", "name" "stacking meta", "options" { "maxiterations" 200, "regularization" 0 1 } }', \ top level stacking options 'maxthreads' = '16', 'haslabelcolumn' = 'true' ) as select feature1, feature2, feature3, feature4, label from my schema training data; after training completes, execute the stacking model select my schema stacking classifier(feature1, feature2, feature3, feature4) from my schema scoring data; related links regression models docid\ qfdqftaykbn7kuiom2zq1 classification models docid\ ncjwl44c3jqoyy 4jx gf clustering and dimension reduction models docid\ vuj3b26x36jrfwm2rradj