supports classification models that involve understanding and grouping large data sets into preset categories or subpopulations. With the help of pre-classified training data sets, machine learning classification models leverage various algorithms to classify future data sets into respective and relevant categories. To create the model, use theDocumentation Index
Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
Use this file to discover all available pages before exploring further.
CREATE MLMODEL syntax. For details, see CREATE MLMODEL.
Model option names are case-sensitive.
K-Nearest Neighbors Classification
Model Type:K NEAREST NEIGHBORS
K-nearest neighbors (KNN) is a classification algorithm, where the first N - 1 inputs are the features, which must be numeric. The last input column is a label, which can be any data type.
There is no training step for KNN. Instead, when you create the model, the model saves a copy of all input data to a table, so that when the model is executed in a later SQL statement, a snapshot of the data the model is supposed to use is available. You can override both the weight function and the distance function.
The system validates the
weight and distance options during prediction time. If the values are invalid, the model throws an error during this time.Model Options
Required
k — This option must be a positive integer that specifies how many closest points to use for classifying a new point.
Optional
distance — If you specify this option, the value must be a function in SQL syntax for calculating the distance between a point used for classification and points in the training data set. This function should use the variables x1, x2, … for the 1st, 2nd, … features in the training data set, and p1, p2, … for the features in the point for classification. If you do not specify this option, the option defaults to the Euclidean distance.
metrics — If you set this option to true, the model calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. The default value is false.
normalize — If you set this option to true, the model automatically computes the mean and standard deviation of each feature and uses them to normalize the data during training. The default value is true.
weight — If you specify this option, the value must be a function in SQL syntax for calculating the weight of a neighbor. The function should use the variable d for distance. By default, the distance is set to 1.0/(d+0.1), thus avoiding division by zero on exact inputs and still allowing neighbors to have some influence.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a KNN model with8 closest points for classification and distance function power(x1 - p1, 2) + power(x2 - p2, 2) + power(x3 - p3, 2).
SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model, the model executes with N - 1 features as input and returns a label. The model chooses the label from the class with the highest score. The model scores classes by summing the weights from the nearest k points in the training data.
SQL
sys.k_nearest_neighbors_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in the System Catalog.
Naive Bayes Classification
Model Type:NAIVE BAYES
Naive Bayes is a classification algorithm. The input is N - 1 feature columns, and the last column is a label column. All columns can be any data type. The label column must be discrete. The feature columns can be discrete or continuous. When you use continuous feature columns, you must specify which columns are continuous (see options).
Naive Bayes works by assuming that all features are equally important in the classification and that there is no correlation between features. With those assumptions, the algorithm computes all frequency information and saves it in three tables that you create using SQL SELECT statements.
Model Options
Optional
metrics — If you set this option to true, the model calculates the percentage of samples correctly classified by the model and saves this information in a catalog table. This option defaults to false.
continuousFeatures — If you specify this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start at 1.
normalize — If you set this option to true, the model automatically computes the mean and standard deviation of each feature and uses them to normalize the data during training. Defaults to true.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a Naive Bayes model with feature indexes1,3.
SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model, you specify N - 1 feature input arguments, and the model returns the most likely class. The returned class is based on computing the class with the highest probability, given prior knowledge of the feature values. In other words, the class y has the highest value of P(y | x1, x2, …, xn).
SQL
sys.naive_bayes_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Decision Tree
Model Type:DECISION TREE
The decision tree is a classification model. The first N - 1 input columns are features and can be any data type. All non-numeric features must be discrete and contain no more than the configured distinctCountLimit number of unique values. This limit is in place to prevent the internal model representation from growing too large. Numeric features are discrete by default and have the same limitation on the number of unique values, but they can be marked as continuous with the continuousFeatures option. For continuous features, the model builds the decision tree by dividing the values into two ranges instead of using discrete, unique values. The last input column is the label and can be any data type.
When you create the model, you specify all features first, and then specify the label as the last column in the result set.
You can use secondary indexes on discrete feature columns to greatly speed up the training of a decision tree model.
supports a similar tree model for regression. For details, see Regression Tree.
Model Options
Optional
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1. In the default state, the model considers no features as continuous.
distinctCountLimit — If you set this option, the value must be a positive integer. This value sets the limit for how many distinct values a non-continuous feature and the label can contain. This option defaults to 256.
doPrune — If you set this option to true, the model uses Pessimistic Error Pruning (PEP) to prune the tree after training. This option defaults to false.
featureArrayElements — If you set this option, the value must be a comma-separated list of the feature indexes. If you set the featureArray option to true, this list determines the elements of the arrays to use for training. By default, the model uses all elements.
maxCellsToFetch — If you set this option, the value must be a positive integer. Controls the chunking behavior when fetching feature values during model training. The limit represents the maximum number of data cells (calculated as number of columns × number of rows) that can be fetched in a single operation, not a byte limit. When the expected data size exceeds this threshold, the algorithm switches to database-based processing using SQL queries instead of in-memory processing. This value defaults to 33,554,432 cells (calculated as 32 * 1024 * 1024).
maxDepth — If you set this option, the value must be a positive integer. This value sets the maximum allowable depth of the decision tree (the maximum number of features to split on). The default is unspecified, which means there is no maximum depth.
maxRows — If you set this option, the value must be a positive integer. This option limits the number of rows used for model training by creating a snapshot table with only the specified number of rows from the input query. This option cannot be used with noSnapshot -> true (attempting to set both results in an invalid argument error during model creation). When this option is unspecified, the model trains using all rows from the input query.
maxThreads — If you set this option, the value must be a positive integer. This value indicates the maximum number of parallel threads to use while the model trains. The default value is 2.
metrics — If you set this option to true, the model also calculates the percentage of samples correctly classified by the model and saves this information in a catalog table. This option defaults to false.
noSnapshot — If you set this option to true, the database does not create an intermediate table that stores the result of the specified SQL statement, which the model uses for training. This option defaults to false. In this case, the database creates and uses the intermediate table. Setting this option to true is useful when the training set is fixed. If the training set is a table with modifications, set this option to false as the decision tree trainer uses different data sets in different parts of the tree. Likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that running that query twice generates the same 100 rows each time.
numSplits — If you set this option, the value must be an integer greater than 1. This value sets the maximum number of binary branches a continuous feature can consider. The default value is 32.
ROCNumSamples — If you set the option, you must also set the metrics option. This positive integer indicates the number of samples for the model to use for the area under the ROC curve. The default value is 10.
skipLimitCheck — If you set this option to true, the model skips cardinality checks that throw errors when columns have too many values. The limit that this option checks is the same one that is specified by the distinctCountLimit option. This option defaults to false.
weighted — If you set this option, the model considers weights for labels. If you set this option value to true, you must specify an additional column as a double in the training data for label weights. Rows with the same labels must have the same weights. If you set this value to auto, the model calculates weights automatically by weighting each label according to the ratio of the count of the most frequent label to the count of the specified label. As a result, the most frequent label has a weight of 1.0, and the other label weights are higher. This option defaults to false, which means all labels have equal weight.
splitMetric — Controls which function is used to evaluate the quality of a split during tree construction. Supported options are:
gini_impurity(default) — Measures impurity based on class distributions.weighted_f1— Uses class-frequency–weighted F1 score to guide splits.
enableResplits — A Boolean value that determines if the tree can reuse the same continuous feature multiple times along a single branch (e.g., split on x1 < 7 and later x1 < 3). This can capture more complex, range-specific relationships. If unspecified, this value defaults to true, meaning continuous features remain available for additional splits after use, allowing the tree to create more complex decision boundaries. When set to false, the model marks continuous features as exhausted after their first use, and they cannot be used again in subsequent splits in the same tree.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a decision tree model with feature indexes1,3.
SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model, you must specify the N - 1 features as parameters. The model returns the expected label.
SQL
sys.decision_tree_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Random Forest
Model Type:RANDOM FOREST
Random forest is a classification model consisting of multiple decision trees. The result consists of the most common label among the tree results and an array of pairs of labels and their frequencies, sorted in descending order by frequency.
The model breaks ties in various ways depending on the type of label. For strings, the model uses lexicographic order (reverse of the usual alphabetic order), so C comes before A, for example. For Booleans, the model chooses true before false. For numeric types, the model chooses the largest number first.
When you call the model to make classification predictions, you can optionally use soft voting by adding an extra Boolean argument to the statement, e.g., house(x, y, true). This extra argument must be a Boolean literal, either true (soft voting) or false (hard voting). If you do not specify this value, the default value is false (hard voting).
In soft voting, each decision tree reports a list of possible results and their confidence factors. The random forest model adds the confidence factors from the decision trees and normalizes them to add up to 1.0. The model sorts the results by descending confidence, then by descending values.
In hard voting, the model does not account for confidence factors. Each decision tree makes its own class prediction as a vote. The model selects the prediction with the most votes from the trees.
Model Options
Required
numChildren — Number of child decision trees.
Optional
bootstrap — If you set this option to true, the model uses bootstrap sampling with replacement, meaning each tree in the random forest is trained on a random subset of the data (either the rowsPerChild or fractionSelected value sets the exact number of rows), and the same row can appear multiple times in each tree. If you set this option to false, this option does not use replacement, meaning each row can appear at most once per tree. The default value is false.
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1.
distinctCountLimit — If you set this option, the value must be a positive integer. This value limits how many distinct values a non-continuous feature and the label can contain. The default value is 256.
featureSubsetStrategy — If you set this option, the model passes this option directly to the child decision trees to specify how many features each tree should consider at each split from the still-available features. When this value is higher, the model has higher accuracy and lower variance, but it takes longer to train.
You can specify this option either as an integer (e.g., 4, meaning consider up to four features at each split) or one of three string values:
all— Each tree checks every feature.sqrt— Each tree checks up to the square root of the total number of features.one-third— Each tree checks up to one-third of the total number of features.
all.
fractionSelected — The proportion of rows the model uses to train each decision tree. The value is a double that must be in the interval (0, 1]. You cannot set this option if you also set the rowsPerChild option to a positive value. The default behavior is that the model uses all available rows.
inputsPerChild — Number of features used to create each child decision tree. The default value is the number of features you specify for the forest divided by 3 and rounded up.
maxCellsToFetch — If you set this option, the value must be a positive integer. Controls the chunking behavior when fetching feature values during model training. The limit represents the maximum number of data cells (calculated as number of columns × number of rows) that can be fetched in a single operation, not a byte limit. When the expected data size exceeds this threshold, the algorithm switches to database-based processing using SQL queries instead of in-memory processing. This value defaults to 33,554,432 cells (calculated as 32 * 1024 * 1024).
maxChildThreads — An integer representing the maximum number of threads each child decision tree can use. If you do not specify this option, each child decision tree uses at most one thread, the default for decision trees.
maxDepth — If you set this option, the value must be a positive integer. This value sets the maximum allowable depth of the decision tree. The default value is 3.
maxThreads — The maximum number of parallel threads to use while the model trains decision trees. This value must be a positive integer. The default value is 16.
metrics — If you set this option to true, the model also calculates the percentage of samples that are correctly classified by the model for the random forest and saves this information in a catalog table. This option is always set to false for the child trees.
noSnapshot — If you set this option to true, the data source must not change. In this case, the database does not create an intermediate table that stores the result of the specified SQL statement, which the model uses for training a random forest. Child decision trees always have this option set to true, so the database does not create a separate intermediate table for each decision tree. The default value is false. Setting this option to true is useful when the training set is fixed. If the training set is a table with modifications, set this option to false as the decision tree trainer uses different data sets in different parts of the tree. Likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that running that query twice generates the same 100 rows each time.
requiredFeatures — A comma-separated list of integers as strings representing specific features where the first feature has the value 1. The model uses these features in every decision tree in the forest. The default behavior is that the decision tree in the forest can train on any feature that is in the list.
rowsPerChild — If you set this option to a positive integer, the number represents the number of rows (from a random sample) to use for each decision tree. If you set this option to 0, each child uses all available rows. The default value is 0. You cannot set this option to a positive value if you also set the fractionSelected option.
skipLimitCheck — If you set this option to true, the model skips cardinality checks that throw errors when columns have too many values. The limit that this option checks is the same one that is specified by the distinctCountLimit option. The default value is false.
weighted — If you set this option, the model considers weights for labels. If you set this option value to true, you must specify an additional column as a double in the training data for label weights. Rows with the same labels must have the same weights. If you set this value to auto, the model calculates weights automatically by weighting each label according to the ratio of the count of the most frequent label to the count of the specified label. As a result, the most frequent label has a weight of 1.0, and the other label weights are higher. This option defaults to false, which means all labels have equal weight.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a random forest model with four child decision trees and two features randomly chosen by the model for each tree. Thetraining_table table contains the training data set. Collect metrics for the model execution by setting the metrics option to true.
SQL
sys.machine_learning_models, sys.machine_learning_model_options, and sys.random_forest_models system catalog tables.
Execute this model using three columns a, b, and c, from large_table table that contains the whole data set.
SQL
SQL
[] with an index of 1, [1], for example, test_model(a,b,c)[1].
After you execute a model, you can find the results in the output of the model function execution.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Logistic Regression
Model Type:LOGISTIC REGRESSION
This model fits a logistic curve to the data across any number of classes greater than one.
The first N - 1 inputs are features and must be numeric. Features can be one-hot encoded. The last input column is the class or label. You must have at least one non-NULL label in the result set for model creation. The model best fits the logistic curve using a negative log likelihood loss function. The model uses an algorithm that is a combination of particle swarm optimization, line search, and genetic algorithms to find the best-fit parameters.
Model Options
Optional
metrics — If you set this option to true, the model calculates the percentage of samples that are correctly classified by the model and saves this information in the sys.logistic_regression_models system catalog table. This option defaults to false.
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. This option defaults to true.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
numEpochs — An INTEGER value representing the maximum number of epochs, or full passes during training through the entire data set. This value must be positive.
If you do not specify this value, the default is 20.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a logistic regression model. Collect metrics for the model execution by setting themetrics option to true.
SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute this model after training, you must specify the features as input and the label as the output. The label can be any data type.
SQL
sys.logistic_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in the System Catalog.
Support Vector Machine
Model Type:SUPPORT VECTOR MACHINE
Support Vector Machine (SVM) essentially finds a hypersurface (the hypersurface is a curve in 2-dimensional space) that correctly splits the data into any number of classes greater than one and maximizes the margin around the hypersurface. By default, SVM finds a hyperplane to split the data (the hyperplane is a straight line in 2-dimensional space). SVM uses a hinge loss function to balance the two objectives of finding a hyperplane with a wide margin while minimizing the number of incorrectly classified points.
The first N - 1 input columns are the features and must be numeric. The last column is the label and can be any arbitrary type. You must have at least one non-NULL label in the result set for model creation.
Model Options
Optional
metrics — If you set this option to true, the model also calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. This option defaults to false.
regularizationCoefficient — If you specify this option, the value must be a valid floating-point number. This option is used to control the balance of finding a wide margin and minimizing incorrectly classified points in the loss function. When this value is larger (and positive), it makes having a wide margin around the hypersurface more important relative to the incorrectly classified points. Because of how implements SVM, the values for this parameter are likely different from the values used in other common SVM implementations. This option defaults to 1.0 / 1000000.0.
functionN — By default, SVM uses a linear kernel. If you use a different kernel, you must provide a list of functions that are summed together, just like with linear combination regression. You must specify the first function using a key named ‘function1’. Subsequent functions must use keys with names that use subsequent values of N. You must specify functions in SQL syntax and use the variables x1, x2, …, xn to refer to the 1st, 2nd, and nth independent variables, respectively. You can specify the default linear kernel as: ‘function1’ → ‘x1’, ‘function2’ → ‘x2’, and so on. The model always adds a constant term equivalent to ‘functionN’ → ‘1.0’ that you do not need to specify explicitly.
normalize — If you set this option to true, the model automatically computes the mean and standard deviation of each feature and uses them to normalize the data during training. Defaults to true.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
numEpochs — An INTEGER value representing the maximum number of epochs, or full passes during training through the entire data set. This value must be positive.
If you do not specify this value, the default is 200.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a support vector machine model.SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model, the N - 1 features must be passed as parameters. The model returns the expected label.
SQL
sys.support_vector_machine_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in the System Catalog.
Gradient Boosted Trees
Model Type:GRADIENT BOOSTED TREES
Gradient Boosted Trees (GBT) is an ensemble machine learning algorithm that builds a sequence of decision trees where each new tree is trained to correct the prediction errors of the previous trees. Unlike the Random Forest model, which creates trees independently in parallel, GBT uses a sequential approach based on gradient descent optimization, where each tree learns to predict the residual errors (gradients) of the current ensemble. This iterative error-correction process allows the model to capture complex, non-linear patterns by progressively refining predictions through multiple weak learners.
The algorithm supports both regression and classification tasks through different loss functions (use the lossFunction option to toggle whether the model performs regression or classification tasks). For regression with squared error loss, trees directly predict residual errors, while classification with logistic loss maintains raw scores transformed through sigmoid or softmax functions.
Model Options
Required
numChildren — An INTEGER value representing the total number of trees to build sequentially. Each tree learns to correct the errors of the previous trees.
learningRate — A DECIMAL value between 0.0 and 1.0 that tunes how much the model learns from each successive child.
Optional
lossFunction — A string value that determines how the model calculates prediction errors and what type of problem it solves. Accepted values include:
'squared_error'— Configures the model for regression tasks. Calculates errors as the squared difference between predicted and actual values. Use this for predicting continuous numeric values (e.g., prices, temperatures, quantities). The target column must contain numeric values. This is the default value.'log_loss'— Configures the model for classification tasks. This model uses logistic loss to calculate prediction errors for probability-based predictions.
fractionSelected — A DECIMAL value greater than 0.0 and less than or equal to 1.0 that specifies what fraction of the training rows to randomly select for each boosting iteration. When you specify a value, the algorithm uses CEIL(fractionSelected * totalRows) per iteration. If you do not specify this value, the model uses all rows in their original order.
inputsPerChild — An INTEGER greater than or equal to 1 that specifies the number of input features each boosting tree should use. This value cannot exceed the number of input features available in the data set. When you specify this value, the algorithm deterministically cycles through pre-enumerated feature subsets to ensure each tree uses exactly this many features. When you do not specify this value, the model uses all available features for each tree.
maxDepth — A positive INTEGER value that represents the maximum allowable depth of the decision trees. The default value is 3.
maxThreads — A positive INTEGER value that sets the maximum number of parallel threads to use for training each child decision tree. Parallel threads do not affect the sequential method of training each tree. The default value is 16.
enableResplits — A Boolean value that determines if the tree can reuse the same continuous feature multiple times along a single branch (e.g., split on x1 < 7 and later x1 < 3). This action can capture more complex, range-specific relationships. If you do not specify this value, the default value is true, meaning continuous features remain available for additional splits after use, which allows the tree to create more complex decision boundaries. When you set this value to false, the model marks continuous features as exhausted after their first use, and the model cannot use these features again in subsequent splits in the same tree.
resplitDepth — An INTEGER value that sets the maximum depth at which tree nodes can be re-split during optimization. Controls how deep the algorithm searches for better split points. If you do not specify this value, the default value is 6.
resplitThreshold — A DECIMAL value that sets the minimum improvement threshold required to trigger a re-split operation. Lower values (e.g., 0.01) allow more aggressive re-splitting but can increase training time. Higher values (e.g., 1.0) require larger improvements to trigger re-splits. If you do not specify this value, the default value is 0.1.
maxCellsToFetch — An INTEGER value that determines the memory threshold to switch from training with system memory to training with SQL queries in the database. In-memory training is generally faster, but is limited by the available SQL Node memory. If the size of a training data subset exceeds this value, then the system performs training operations using SQL queries. The default value is 33,554,432 (calculated as 32 * 1024 * 1024).
metrics — A Boolean value. If you set this value to true, the system calculates and stores final model metrics (R²/RMSE for regression or Accuracy/LogLoss for classification) on the training data.
featureArray — A Boolean value. If you set this value to true, the model expects all input features to be in a single ARRAY column instead of separate columns.
continuousFeatures — A set of INTEGER values that specify the input feature columns that the model should treat as continuous numeric variables for optimal tree splitting. The value must be a comma-separated list of feature column indexes (starting with 1) that correspond to numeric columns in your SELECT SQL statement, excluding the target column. For example, 'continuousFeatures' -> '1,2' would treat the first and second columns of your SELECT statement as continuous. If you do not specify this value, the model treats all features as categorical, which can impair results for numeric columns like prices, measurements, or scores.
Execute the Model
This example creates a Gradient Boosted Trees classifier to predict whether a customer stops using a service. The model trains on four input features (tenure_months, monthly_charges, total_charges, support_tickets) to predict the target variable (churn_label), filtering out any records with missing labels to ensure clean training.
The model uses these options:
'numChildren' -> '100'— Builds a sequence of 100 weak learners (regression trees). Each tree learns to correct the classification residuals (logistic loss) of all previous trees. A higher value can capture complex patterns, and you should pair it with an appropriate learning rate.'learningRate' -> '0.1'— Scales the contribution of each new tree by 10 percent, balancing convergence speed and generalization. This value represents a small learning rate, making training more conservative and stable.'lossFunction' -> 'log_loss'— Configures the model for classification through logistic loss. If you do not specify this value, the default value is'squared_error'(regression), so explicitly setting'log_loss'is mandatory for classifiers.
SQL
SELECT query, passing in the feature columns as arguments.
SQL
element [1]— The predicted class labelelement [2]— An array of (class_label,probability) pairs
sys.gradient_boosted_trees_models system catalog table.

