This list of options contains model options for all models in the System.Documentation Index
Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
Use this file to discover all available pages before exploring further.
Association Rules
Model Options
Optional
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
Bagging
Model Options
Required
baseModels — This option specifies the children of the bagging model. You must specify this value as a JSON array, where each object in the array has the three fields type, count, and options.
taskType — This option specifies the type of task, which must be either CLASSIFICATION or REGRESSION depending on the type of model for training.
Optional
ROCNumSamples — If you set this option, you must specify a positive integer that represents the number of samples for the model to use when calculating the area under the ROC curve. You must also set the metrics option to true. The default value is the number of child models.
bootstrap — If you set this option to true, the model uses bootstrap sampling with replacement, meaning each child model trains on a random subset of the data (either the rowsPerChild or fractionSelected value sets the exact number of rows), and the same row can appear multiple times for each child. If you set this option to false, the model does not use replacement, meaning each row can appear at most one time per child. The default value is false.
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1. In the default state, the model considers no features as continuous.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
fractionSelected — If you set this option, the option represents the proportion of rows the model uses to train each child model. The value is a double that must be in the interval (0, 1]. You cannot set this option if you also set the rowsPerChild option to a positive value. The default behavior is that the model uses all available rows.
inputsPerChild — If you set this option, the option represents the number of features used to create each child model. The default value is the total number of features divided by 3 and rounded up.
maxChildThreads — If you set this option, the value must be an integer representing the maximum number of threads each child model can use. If a child accepts a maxThreads option, the model passes this value to the child.
maxThreads — If you set this option, the option represents the maximum number of parallel threads to use while the model trains. This value must be a positive integer. The default value is 16.
metrics — If you set this option to true, the system calculates certain metrics depending on the value of the taskType option. If you set the taskType option to CLASSIFICATION, the metrics are the percentage of correctly classified rows and the area under the ROC curve. If you set the value to REGRESSION, the metrics are the root mean square error and the adjusted R-squared. The default value is false.
noSnapshot — If you set this option to true, the data source must not change. In this case, the database does not create an intermediate table that stores the result of the specified SQL statement, which the model uses for training a random forest. Child decision trees always have this option set to true, so the database does not create a separate intermediate table for each decision tree. The default value is false. Setting this option to true can speed up training when the training set is fixed.
requiredFeatures — If you set this option, the value must be a comma-separated list of integers representing features starting at index 1. The bagging model passes these features down to every child. The default value is an empty list, meaning there is no required feature.
rowsPerChild — If you set this option to a positive integer, the number represents the number of rows (from a random sample) to use for each decision tree. If you set this option to 0, each child uses all available rows. The default value is 0. You cannot set this option to a positive value if you also set the fractionSelected option.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
weighted — If you set this option, the system passes the value directly to child models that support it. The behavior depends on the model type of the target child.
Boosting
Model Options
Required
baseModels — This option specifies the children of the bagging model. You must specify this as a JSON array, where each object in the array has the three fields type, count, and options.
learningRate — A decimal value between 0.0 and 1.0 that tunes how much the model learns from each successive child.
taskType — This option specifies the type of task, which must be either CLASSIFICATION or REGRESSION depending on the type of model for training.
Optional
ROCNumSamples — If you set this option, you must specify a positive integer that represents the number of samples for the model to use when calculating the area under the ROC curve. You must also set the metrics option to true. The default value is 10.
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1. In the default state, the model considers no features as continuous.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
fractionSelected — If you set this option, the option represents the proportion of rows the model uses to train each child model. The value is a double that must be in the interval (0, 1]. You cannot set this option if you also set the rowsPerChild option to a positive value. The default behavior is that the model uses all available rows.
inputsPerChild — If you set this option, the value must be an integer type that is greater than or equal to 1, which specifies the number of input features each boosting child should use. This value cannot exceed the number of input features available in the data set. When you specify this value, the algorithm deterministically cycles through pre-enumerated feature subsets to ensure each child uses exactly the specified number of features. When you do not specify this value, the model uses all available features for each child.
lossFunction — If you set this option, the value represents the loss function used by the model. Accepted values are: 'squared_error' and 'log_loss'. When you set this value to 'squared_error', the model calculates errors as the squared difference between predicted and actual values. The target column must contain numeric values. This is the default value when the taskType option is set to REGRESSION. When you set this value to 'log_loss', the model calculates errors using logistic loss. This is the default value when the taskType option is CLASSIFICATION.
maxThreads — If you set this option, the option represents the maximum number of parallel threads to use while the model trains. This value must be a positive integer. The default value is 16.
metrics — If you set this option to true, the system calculates certain metrics depending on the value of the taskType option. If the taskType option is set to CLASSIFICATION, the metrics are the percentage of correctly classified rows and the area under the ROC curve. If the taskType option is set to REGRESSION, the metrics are the root mean square error and the adjusted R-squared. The default value is false.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Decision Tree
Model Options
Optional
ROCNumSamples — If you set this option, you must specify a positive integer that represents the number of samples for the model to use when calculating the area under the ROC curve. You must also set the metrics option to true. The default value is 10.
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1. In the default state, the model considers no features as continuous.
distinctCountLimit — If you set this option, the value must be a positive integer type. This value sets the limit for how many distinct values a non-continuous feature and label can contain. The default value is 256.
doPrune — If you set this option to true, the model uses Pessimistic Error Pruning (PEP) to prune the tree after training. The default value is false.
enableResplits — If you set this option, it must be a boolean type that determines if the tree can reuse the same continuous feature multiple times along a single branch (e.g., split on x1 < 7 and later x1 < 3). This action can capture more complex, range-specific relationships. The default value is true, meaning that continuous features remain available for additional splits after use, thereby allowing the tree to create more complex decision boundaries. If you set this option to false, the model marks continuous features as exhausted after their first use, and the model cannot use them again in subsequent splits in the same tree.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
featureSubsetStrategy — If you set this option, the option specifies how many features the decision tree should consider at each split from the still-available features. When this value is higher, the model has a higher accuracy and lower variance, but takes longer to train. You can specify this option either as an integer (e.g., 4, meaning consider up to four features at each split) or one of these values: all (check every feature), sqrt (check up to the square root of the number of total features), and one-third (check up to one-third of the number of total features). The default value is all.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
maxCellsToFetch — If you set this option, the value must be a positive integer. Controls the chunking behavior when fetching feature values during model training. The limit represents the maximum number of data cells (calculated as number of columns × number of rows) that can be fetched in a single operation, not a byte limit. When the expected data size exceeds this threshold, the algorithm switches to database-based processing using SQL queries instead of in-memory processing. This value defaults to 33,554,432 cells (calculated as 32 × 1024 × 1024).
maxDepth — If you set this option, the value must be a positive integer. This value sets the maximum allowable depth of the decision tree (the maximum number of features to split on). The default is unspecified, which means there is no maximum depth.
maxRows — If you set this option, the value must be a positive integer. This option limits the number of rows used for model training by creating a snapshot table with only the specified number of rows from the input query. This option cannot be used with noSnapshot -> true (attempting to set both results in an invalid argument error during model creation). When this option is unspecified, the model trains using all rows from the input query.
maxThreads — If you set this option, the value must be a positive integer. This value indicates the maximum number of parallel threads to use while the model trains. The default value is 2.
metrics — If you set this option to true, the model also calculates the percentage of samples correctly classified by the model and saves this information in a catalog table. This option defaults to false.
noSnapshot — If you set this option to true, the database does not create an intermediate table that stores the result of the specified SQL statement, which the model uses for training. This option defaults to false. In this case, the database creates and uses the intermediate table. Setting this option to true is useful when the training set is fixed. If the training set is a table with modifications, set this option to false as the decision tree trainer uses different data sets in different parts of the tree. Likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that running that query twice generates the same 100 rows each time.
numSplits — If you set this option, the value must be an integer greater than 1. This value sets the maximum number of binary branches a continuous feature can consider. The default value is 32.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
skipLimitCheck — If you set this option to true, the model skips cardinality checks that throw errors when columns have too many values. The limit that this option checks is the same one specified by the distinctCountLimit option. The default value is false.
splitMetric — If you set this option, the option controls which function the model uses to evaluate the quality of a split during tree construction. Supported options are: gini_impurity (measures impurity based on class distributions), macro_f1 (uses the macro-averaged F1 score to guide splits), micro_f1 (uses the micro-averaged F1 score to guide splits), and weighted_f1 (uses the class-frequency-weighted F1 score to guide splits). The default value is gini_impurity.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
suppressJIT — If you set this option to true, the model suppresses just-in-time code generation.
weighted — If you set this option, the model considers weights for labels. If you set this option value to true, you must specify an additional column as a double in the training data for label weights. Rows with the same labels must have the same weights. If you set this value to auto, the model calculates weights automatically by weighting each label according to the ratio of the count of the most frequent label to the count of the specified label. As a result, the most frequent label has the weight 1.0 and the other label weights are higher. The default value is false, which means all labels have equal weight.
Feedforward Network
Model Options
Required
hiddenLayerSize — You must set this option to a positive integer type that specifies the number of nodes in each hidden layer.
hiddenLayers — You must set this option to a positive integer type that specifies how many hidden layers to use.
lossFunction — This option specifies the loss function that all hidden layer nodes and all output layer nodes use. This function can be one of several predefined loss functions or a user-defined loss function. The predefined loss functions are squared_error (regression), vector_squared_error (vector-valued regression), log_loss (binary classification with target values of 0 and 1), logits_loss (binary classification with target values of 0 and 1), hinge_loss (binary classification with target values of -1 and 1), and cross_entropy_loss (multi-class classification). If the value for this required option is none of these strings, the model assumes a user-defined loss function. The user-defined loss function specifies the per-sample loss. Then, the actual loss function is the sum of this function applied to all samples. The model should use the variable y to refer to the dependent variable in the training data, and the model should use the variable f to refer to the computed estimate for the specified sample.
outputs — You must set this option to a positive integer that specifies the number of outputs.
Optional
activationFunction — If you set this option, the values are linear, relu (rectified linear unit), leakyrelu (leaky rectified linear unit), tanh (hyperbolic tangent function), or sigmoid (fast sigmoid approximation). The default value is relu. This option affects all layers except the output layer.
adamBeta1 — If you set this option, the option represents the value of β₁ in the Adam optimization algorithm. For higher values of this option, training is less noisy but takes longer to converge. The default value is 0.9.
adamBeta2 — If you set this option, the option represents the value of β₂ in the Adam optimization algorithm. For higher values of this option, training is less noisy but takes longer to converge. The default value is 0.99.
adamEpsilon — If you set this option, the option represents the value of ε in the Adam optimization algorithm. For higher values of this option, training is more numerically stable but takes longer to converge. The default value is 1e-7.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
finiteDifferenceH — If you set this option, the value must be a double representing the step size (h) for approximating gradients using the finite difference method. The model uses this value only if analytical gradients are not active. This value should generally be a small positive value, typically from 0.0001 (1e-4) to 0.0000001 (1e-7). The default value is 0.00001 (1e-5).
gradientClipThreshold — If you set this option, the value must be a double that represents the gradient norm threshold for clipping. When the overall gradient norm exceeds this threshold, the system scales all gradient components uniformly to preserve direction. This operation prevents issues with exploding gradients in unstable loss landscapes. Set this value to 0 or a negative value to disable gradient clipping. The default value is 1000000 (1e6).
learningRate — If you set this option, the value must be a double type representing the base learning rate for the Adam (Adaptive Moment Estimation) machine learning optimizer. Adam adapts this rate individually for each parameter during training. A common starting point for Adam is 0.001 (1e-3). Valid values must be positive and are generally in the range of 0.00001 (1e-5) to 0.01 (1e-2). A higher learning rate can speed up training, but can cause the optimizer to overshoot and miss optimal solutions. Conversely, a lower learning rate ensures more stable and precise convergence but can make training much slower. If you do not specify this option, the system automatically selects a learning rate and adjusts it during training using the 1Cycle learning schedule. Specifying a learning rate disables automatic adjustment and instead uses a fixed learning rate value.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
maxInitParamValue — If you set this option, the value must be a floating-point number. Sets the maximum for initial parameter values in the optimization algorithm. The default value is 1.
metrics — If you set this option to true, the model calculates the average value of the loss function.
minInitParamValue — If you set this option, the value must be a floating-point number. Sets the minimum for initial parameter values in the optimization algorithm. The default value is -1.
normalize — If you set this option to true, this option applies z-score normalization to inputs by default, storing means and standard deviation, and automatically applying them at inference. The default value is true.
numEpochs — If you set this option, the value must be a positive integer type representing the maximum number of epochs, or full passes, during training through the entire data set. If you do not specify this option, the default maximum is 200, but training typically stops earlier due to automatic early stopping when the model has converged.
outputActivationFunction — If you set this option, the values are linear, relu (rectified linear unit), leakyrelu (leaky rectified linear unit), tanh (hyperbolic tangent function), or sigmoid (fast sigmoid approximation). Different activation functions have different output ranges. The chosen activation function should match the dependent variable of your data. For example, if the dependent variable can be anything, then choose the linear value. If the dependent variable is always positive, then choose the relu value. If your outputs range from -1 to 1 or you perform hinge loss classification, tanh is a good option because the hyperbolic tangent function has the same range. But, if your outputs range from 0 to 1 or you perform log loss classification, sigmoid is a better choice for the same reason. This option defaults to linear. The option only sets the activation function for the output layer.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
randomSeed — If you set this option, the value must be a positive integer type representing the seed for the random number generator the system uses for weight initialization. Setting this option makes model training deterministic (given the same data and options). If you do not specify this option or set it to 0, the system uses a non-deterministic random seed.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
useSoftmax — If you set this option to true, the model applies a softmax function to the output of the output layer before computing the loss function. The default value is true if you set the lossFunction to cross_entropy_loss, and false otherwise.
Gaussian Mixture Model
Model Options
Required
numDistributions — This option must be a positive integer type that specifies the number of clusters of Gaussian distributions for the model to make.
Optional
epsilon — If you specify this option, the value must be a valid positive floating point number. When the maximum distance that the entire best model moves in its n-dimensional space is less than this value, the algorithm terminates. The default value is 0.00000001 (1e-8).
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
maxIterations — If you set this option, the option represents the maximum number of optimization iterations to train the model. For higher values of this option, the model is likelier to converge to the expected epsilon, but it might take longer to train. The default value is 100.
normalize — If you set this option to true, the model automatically computes the mean and standard deviation of each feature and uses them to normalize the data during training. Defaults to true.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Gradient Boosted Trees
Model Options
Required
learningRate — A decimal value between 0.0 and 1.0 that tunes how much the model learns from each successive child.
numChildren — An integer value representing the total number of trees to build sequentially. Each tree learns to correct the errors of the previous trees.
Optional
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1. In the default state, the model considers no features as continuous.
enableResplits — If you set this option, the value must be a boolean type that determines if the tree can reuse the same continuous feature multiple times along a single branch (e.g., split on x1 < 7 and later x1 < 3). This action can capture more complex, range-specific relationships. The default value is true, meaning that continuous features remain available for additional splits after use, thereby allowing the tree to create more complex decision boundaries. If you set this option to false, the model marks continuous features as exhausted after their first use, and the model cannot use them again in subsequent splits in the same tree.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
fractionSelected — If you set this option, the option represents the proportion of rows the model uses to train each child model. The value is a double that must be in the interval (0, 1]. You cannot set this option if you also set the rowsPerChild option to a positive value. The default behavior is that the model uses all available rows.
inputsPerChild — If you set this option, the value must be an integer type greater than or equal to 1 that specifies the number of input features each boosting tree should use. This value cannot exceed the number of input features available in the data set. When you specify this value, the algorithm deterministically cycles through pre-enumerated feature subsets to ensure each tree uses exactly the specified number of features. The default behavior is that the model uses all available features for each tree.
lossFunction — If you set this option, the option represents the loss function used and determines the type of task the model does. Accepted values are: 'squared_error' and 'log_loss'. When you set this value to 'squared_error', the model calculates errors as the squared difference between predicted and actual values. The target column must contain numeric values. This is the default value for regression tasks. When you set this value to 'log_loss', the model calculates errors using logistic loss. This is the default value for classification tasks.
maxCellsToFetch — If you set this value, the value must be an integer type that determines the memory threshold to switch from training with system memory to training with SQL queries in the database. In-memory training is generally faster, but is limited by the available SQL Node memory. If the size of a training data subset exceeds this value, then the system performs training operations using SQL queries. The default value is 33,554,432 (calculated as 32 * 1024 * 1024).
maxDepth — If you set this value, the value must be a positive integer type that represents the maximum allowable depth of the child trees. The default value is 3.
maxThreads — If you set this value, the value must be a positive integer type that sets the maximum number of parallel threads to use for training each child decision tree. Parallel threads do not affect the sequential method of training each tree. The default value is 16.
metrics — If you set this value to true, the system calculates and stores final model metrics (R²/RMSE for regression or Accuracy/LogLoss for classification) on the training data. The default value is false.
numSplits — If you set this option, the value must be an integer greater than 1. This value sets the maximum number of binary branches a continuous feature can consider. The default value is 32.
resplitDepth — If you set this option, the value must be an integer type that sets the maximum depth at which tree nodes can be re-split during optimization. This option controls how deep the algorithm searches for better split points. The default value is 6.
resplitThreshold — If you set this option, the value must be a decimal type that sets the minimum improvement threshold required to trigger a re-split operation. Lower values allow more aggressive re-splitting but can increase training time. Higher values require larger improvements to trigger re-splits. The default value is 0.1.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Kmeans
Model Options
Required
k — This option must be a positive integer type that specifies how many clusters to make.
Optional
epsilon — If you specify this option, the value must be a valid positive floating point value. When the maximum distance that a centroid moves from one iteration of the algorithm to the next is less than this value, the algorithm terminates. The default value is 0.00000001 (1e-8).
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
lloydRounds — If you set this option, the option represents the maximum number of iterations of the Lloyd algorithm to train the model after guessing the centroids. For higher values of this option, the model is more likely to be accurate but takes longer to train. The default value is 20.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
normalize — If you set this option to true, the model normalizes the data before the start of training. The default value is true.
oversampling — If you set this option, the option represents the number of candidate guesses for the model to choose in the parallel-round phase of k-means||. For higher values of this option, the model is more likely to be accurate but takes longer to train. The default value is k.
parallelRounds — If you set this option, the option represents the minimum number of parallel rounds for which the k-means|| algorithm runs. For higher values of this option, the model is more likely to be accurate but takes longer to train. The default value is 8.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
K Nearest Neighbors
Model Options
Required
k — This option must be a positive integer type that specifies how many closest points to use for classifying a new point.
Optional
distance — If you set this option, the value must be a function in SQL syntax for calculating the distance between a point used for classification and points in the training data set. This function should use the variables x1, x2, … for the 1st, 2nd, … features in the training data set, and p1, p2, … for the features in the point for classification. The default value is the Euclidian distance function.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
normalize — If you set this option to true, the model automatically computes the mean and standard deviation of each feature and uses them to normalize the data during training. The default value is true.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
weight — If you specify this option, the value must be a function in SQL syntax for calculating the weight of a neighbor. The function should use the variable d for distance. By default, the distance is set to 1.0 / (d + 0.1), thus avoiding division by zero on exact inputs and still allowing neighbors to have some influence.
Linear Combination Regression
Model Options
Required
functionN — You must specify the first function using a key named 'function1'. Subsequent functions must use keys with names that use subsequent values of N. You must specify functions in SQL syntax and should use the variables x1, x2, ..., xn to refer to the 1st, 2nd, and nth independent variables, respectively. For example,'function1' -> 'sin(x1 * x2 + x3)', 'function2' -> 'cos(x1 * x3)'.
Optional
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
gamma — If you set this option, the value must be a matrix. This value represents a Tikhonov gamma matrix used for regularization. For details, see Tikhonov regularization.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (R-squared), the adjusted coefficient of determination, and the root mean squared error (RMSE). The default value is false.
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. The default value is true.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
threshold — This option enables soft thresholding. If you specify this option, the option must be a positive numeric value. After the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients. For any coefficients between the negative and positive threshold values, the model sets the coefficients to zero.
weighted — If you set this option to true, the model performs weighted least squares regression, where each sample has an associated weight or importance. When weighted, there is an extra numeric column after the dependent variable that represents the weight of the sample. The default value is false.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
Linear Discriminant Analysis
Model Options
Optional
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
normalize — If you set this option to true, the model automatically computes the mean and standard deviation of each feature and uses them to normalize the data during training. The default value is true.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Logistic Regression
Model Options
Optional
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
metrics — If you set this option to true, the model calculates the percentage of samples that are correctly classified by the model and saves this information in the sys.logistic_regression_models system catalog table. The default value is false.
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. The default value is true.
numEpochs — If you set this option, the value must be a positive integer type representing the maximum number of IRLS iterations during training. The default value is 20.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Multiple Linear Regression
Model Options
Optional
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
gamma — If you set this option, the option must be a matrix. The value represents a Tikhonov gamma matrix used for regularization. For details, see Tikhonov regularization. The model uses this option for ridge regression.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (R-squared), the adjusted coefficient of determination, and the root mean squared error (RMSE). The default value is false.
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. The default value is true.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
threshold — If you set this option, the option enables soft thresholding. The value must be a positive number. After the model calculates the coefficients, if any coefficients exceed the threshold value, the model subtracts the threshold value from those coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients. For any coefficients that are between the negative and positive threshold values, the model sets the coefficients to zero.
weighted — If you set this option to true, the model performs weighted least squares regression, where each sample has a weight or importance associated with it. In this case, the table contains an additional numeric column after the dependent variable, which contains the weight for the sample. The default value is false.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
Naive Bayes
Model Options
Optional
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1. In the default state, the model considers no features as continuous.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
metrics — If you set this option to true, the model calculates the percentage of samples correctly classified by the model and saves this information in a system catalog table. The default value is false.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Nonlinear Regression
Model Options
Required
function — Specify the name of the function to fit the data in SQL syntax. Use a1, a2, … to refer to the parameters for optimization. Use x1, x2, … to refer to the input features. The model does not allow some SQL functions. The model allows only scalar expressions that can be represented internally as postfix expressions. Most notably, the model does not allow some functions that are rewritten as CASE statements (like least() and greatest()). If your function is not allowed, the model displays an error message.
numParameters — Specify this option as a positive integer. This value specifies the number of different parameters to optimize, i.e., how many different aN variables there are in the user-specified function.
Optional
adamBeta1 — If you set this option, the option represents the value of β₁ in the Adam optimization algorithm. For higher values of this option, training is less noisy but takes longer to converge. The default value is 0.9.
adamBeta2 — If you set this option, the option represents the value of β₂ in the Adam optimization algorithm. For higher values of this option, training is less noisy but takes longer to converge. The default value is 0.99.
adamEpsilon — If you set this option, the option represents the value of ε in the Adam optimization algorithm. For higher values of this option, training is more numerically stable but takes longer to converge. The default value is 1e-7.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
finiteDifferenceH — If you set this option, the value must be a double type representing the step size (h) for approximating gradients using the finite difference method. The model uses this value only if analytical gradients are not active. This value should generally be a small positive number, typically from 0.0001 (1e-4) to 0.0000001 (1e-7). The default value is 0.00001 (1e-5).
gradientClipThreshold — If you set this option, the value must be a double that represents the gradient norm threshold for clipping. When the overall gradient norm exceeds this threshold, the system scales all gradient components uniformly to preserve direction. This operation prevents issues with exploding gradients in unstable loss landscapes. Set this value to 0 or a negative value to disable gradient clipping. The default value is 1000000 (1e6).
lassoCoefficient — If you specify this option, the value must be a double data type. This option is the lasso coefficient for the loss function. The default behavior is the function ignores this option, effectively setting this option to 0.0.
learningRate — If you set this option, the value must be a double type representing the base learning rate for the Adam (Adaptive Moment Estimation) machine learning optimizer. Adam adapts this rate individually for each parameter during training. A common starting point for Adam is 0.001 (1e-3). Valid values must be positive and are generally in the range of 0.00001 (1e-5) to 0.01 (1e-2). A higher learning rate can speed up training, but can cause the optimizer to overshoot and miss optimal solutions. Conversely, a lower learning rate ensures more stable and precise convergence but can make training much slower. If you do not specify this option, the system automatically selects a learning rate and adjusts it during training using the 1Cycle learning schedule. Specifying a learning rate disables automatic adjustment and instead uses a fixed learning rate value.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
lossFunction — If you set this option, the option indicates to the nonlinear optimizer the loss function to use on a per-sample basis. Then, the actual loss function is the sum of this function applied to all samples. The model should use the variable y to refer to the dependent variable in the training data and the variable f to refer to the computed estimate for the specified sample. The default is the least squares function, which you can specify as (f-y)*(f-y).
maxInitParamValue — If you specify this option, the value must be a floating-point number. This option sets the maximum for initial parameter values in the optimization algorithm. The default value is 1.
metrics — If you set this option to true, the model calculates the coefficient of determination (R-squared), the adjusted R-squared, and the root mean squared error (RMSE). However, the model calculates these quality metrics using the least squares loss function, and not the user-specified loss function, because these metrics only make sense for least squares. The default value is false
minInitParamValue — If you specify this option, the value must be a floating-point number. This option sets the minimum for initial parameter values in the optimization algorithm. The default value is -1.
numEpochs — If you set this option, the value must be a positive integer type representing the maximum number of epochs, or full passes, during training through the entire data set. If you do not specify this option, the default maximum value is 200 for Adam optimization or 100 for Levenberg-Marquardt, but training typically stops earlier due to automatic early stopping when the model has converged.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
randomSeed — If you set this option, the value must be a positive integer type representing the seed for the random number generator the system uses for weight initialization. Setting this option makes model training deterministic (given the same data and options). If you do not specify this option or set it to 0, the system uses a non-deterministic random seed.
ridgeCoefficient — If you specify this option, the value must be a double data type. This option is the ridge coefficient for the loss function. The default behavior is the function ignores this option, effectively setting this option to 0.0.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Polynomial Regression
Model Options
Required
order — This option is the degree of the polynomial and must be set to a positive integer.
Optional
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
gamma — If you specify this option, the value must be a matrix. The value represents a Tikhonov gamma matrix that is used for regularization. For details, see Tikhonov regularization.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (R-squared), the adjusted coefficient of determination, and the root mean squared error (RMSE). The default value is false.
negativePowers — If you set this option to true, the model includes independent variables raised to negative powers. These variables are named Laurent polynomials. The model generates all possible terms such that the sum of the absolute value of the power of each term in each product is less than or equal to the order. For example, with two independent variables and the order set to 2, the model is: y = a1*x1^2 + a2*x1^-2 + a3*x2^2 + a4*x2^-2 + a5*x1*x2 + a6*x1^-1*x2 + a7*x1*x2^-1 + a8*x1^-1*x2^-1 + a9*x1 + a10*x1^-1 + a11*x2 + a12*x2^-1 + b. The default value is false.
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. The default value is true.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
threshold — This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from them. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
weighted — If you set this option to true, the model performs weighted least squares regression, where each sample has an associated weight. When weighted, there is an extra numeric column after the dependent variable that has the weight for the sample. The default value is false.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
Principal Component Analysis
Model Options
Optional
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Random Forest
Model Options
Required
numChildren — Number of child decision trees.
Optional
ROCNumSamples — If you set the option, you must also set the metrics option. This positive integer indicates the number of samples for the model to use for the area under the ROC curve. The default value is the number of child decision trees.
bootstrap — If you set this option to true, the model uses bootstrap sampling with replacement, meaning the model trains each tree in the random forest on a random subset of the data (either the rowsPerChild or fractionSelected option sets the exact number of rows), and the same row can appear multiple times in each tree. If you set this option to false, this option does not use replacement, meaning each row can appear at most once per tree. The default value is false.
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1. In the default state, the model considers no features as continuous.
distinctCountLimit — If you set this option, the value must be a positive integer. This value limits how many distinct values a non-continuous feature and the label can contain. The default value is 256.
doPrune — If you set this option to true, the model uses Pessimistic Error Pruning (PEP) to prune the tree after training. The default value is false.
enableResplits — If you set this option, the value must be a boolean type that determines if the tree can reuse the same continuous feature multiple times along a single branch (e.g., split on x1 < 7 and later x1 < 3). This action can capture more complex, range-specific relationships. The default value is true, meaning that continuous features remain available for additional splits after use, allowing the tree to create more complex decision boundaries. When you set this option to false, the model marks continuous features as exhausted after their first use, and the model cannot use them again in subsequent splits in the same tree. The model passes this option directly to the child decision trees.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
featureSubsetStrategy — If you set this option, the model passes this option directly to the child decision trees. The option specifies how many features the child decision trees should consider at each split from the still-available features. When this value is higher, the model will have higher accuracy and lower variance, but takes longer to train. You can specify this option either as an integer (e.g., 4, meaning consider up to 4 features at each split) or one of the three possible string options: all (checks every feature), sqrt (checks up to the square root of the number of total features), and one-third (checks up to one-third of the number of total features). The default value is all.
fractionSelected — If you set this option, the option represents the proportion of rows the model uses to train each child model. The value is a double that must be in the interval (0, 1]. You cannot set this option if you also set the rowsPerChild option to a positive value. The default behavior is that the model uses all available rows.
inputsPerChild — If you set this option, the option specifies the number of features for the creation of each child decision tree. The default value is the number of features you specify for the forest divided by 3 and rounded up.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
maxCellsToFetch — If you set this option, the model passes this option directly to the child decision trees. If you set this option, the value must be a positive integer. Controls the chunking behavior when fetching feature values during model training. The limit represents the maximum number of data cells (calculated as the number of columns × number of rows) that the system can fetch in a single operation, not a byte limit. When the expected data size exceeds this threshold, the algorithm switches to database-based processing using SQL queries instead of in-memory processing. The default value is 33,554,432 cells (calculated as 32 × 1024 × 1024).
maxChildThreads — If you set this option, the value must be an integer type representing the maximum number of threads each child decision tree can use. The default value is 1.
maxDepth — If you set this option, the value must be a positive integer. This value sets the maximum allowable depth of the decision tree. The default value is 3.
maxThreads — If you set this option, the option specifies the maximum number of parallel threads to use while the model trains decision trees. This value must be a positive integer. The default value is 16.
metrics — If you set this option to true, the model also calculates the percentage of samples that are correctly classified by the model for the random forest and saves this information in a system catalog table. The default value is false.
noSnapshot — If you set this option to true, the data source must not change. In this case, the database does not create an intermediate table that stores the result of the specified SQL statement, which the model uses for training a random forest. Child decision trees always have this option set to true, so the database does not create a separate intermediate table for each decision tree. The default value is false. Setting this option to true is useful when the training set is fixed. If the training set is a table with modifications, set this option to false as the decision tree trainer uses different data sets in different parts of the tree. Likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that running that query twice generates the same 100 rows each time.
numSplits — If you set this option, the value must be an integer type greater than 1. This value sets the maximum number of binary branches a continuous feature can consider. The default value is 32.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
requiredFeatures — If you set this option, the option must be a comma-separated list of integers as strings representing specific features where the first feature has the value 1. The model uses these features in every decision tree in the forest. The default behavior is that the decision tree in the forest can train on any feature in the list.
rowsPerChild — If you set this option to a positive integer, the number represents the number of rows (from a random sample) to use for each decision tree. If you set this option to 0, each child uses all available rows. The default value is 0. You cannot set this option to a positive value if you also set the fractionSelected option.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
skipLimitCheck — If you set this option to true, the model skips cardinality checks that throw errors when columns have too many values. The limit that this option checks is the same one that is specified by the distinctCountLimit option. This option defaults to false.
splitMetric — If you set this option, the option controls which function the model uses to evaluate the quality of a split during tree construction. Supported options are: gini_impurity (measures impurity based on class distributions), macro_f1 (uses macro-averaged F1 score to guide splits), micro_f1 (uses micro-averaged F1 score to guide splits), and weighted_f1 (uses class-frequency-weighted F1 score to guide splits). The default value is gini_impurity.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
weighted — If you set this option, the model considers weights for labels. If you set this option value to true, you must specify an additional column as a double in the training data for label weights. Rows with the same labels must have the same weights. If you set this value to auto, the model calculates weights automatically by weighting each label according to the ratio of the count of the most frequent label to the count of the specified label. As a result, the most frequent label has a weight of 1.0, and the other label weights are higher. This option defaults to false, which means all labels have equal weight.
Regression Tree
Model Options
Optional
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1. In the default state, the model considers no features as continuous.
distinctCountLimit — If you set this option, the value must be a positive integer. This value sets the limit for the number of distinct values a non-continuous feature and the label can contain. This option defaults to 256.
enableResplits — If you set this option, the value must be a boolean type that determines if the tree can reuse the same continuous feature multiple times along a single branch (e.g., split on x1 < 7 and later x1 < 3). This action can capture more complex, range-specific relationships. The default value is true, meaning that continuous features remain available for additional splits after use, which allows the tree to create more complex decision boundaries. When you set this option to false, the model marks continuous features as exhausted after their first use, and the model cannot use them again in subsequent splits in the same tree.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
featureSubsetStrategy — If you set this option, the option specifies how many features the regression tree should consider at each split from the still-available features. When this value is higher, the model has a higher accuracy and lower variance, but takes longer to train. You can specify this option either as an integer (e.g., 4, meaning consider up to four features at each split) or one of these values: all (check every feature), sqrt (check up to the square root of the number of total features), and one-third (check up to one-third of the number of total features). The default value is all.
maxCellsToFetch — If you set this option, the value must be a positive integer. Controls the chunking behavior when fetching feature values during model training. The limit represents the maximum number of data cells (calculated as the number of columns × number of rows) that the system can fetch in a single operation, not a byte limit. When the expected data size exceeds this threshold, the algorithm switches to database-based processing using SQL queries instead of in-memory processing. The default value is 33,554,432 cells (calculated as 32 × 1024 × 1024).
maxDepth — If you set this option, the value must be a positive integer. This value sets the maximum allowable depth of the decision tree (the maximum number of features to split on). The default is unspecified, which means there is no maximum depth.
maxRows — If you set this option, the value must be a positive integer. This option limits the number of rows used for model training by creating a snapshot table with only the specified number of rows from the input query. This option cannot be used with noSnapshot -> true (attempting to set both results in an invalid argument error during model creation). When this option is unspecified, the model trains using all rows from the input query.
maxThreads — If you set this option, the value must be a positive integer. This value indicates the maximum number of parallel threads to use while the model trains. The default value is 2.
metrics — If you set this option to true, the model also calculates the percentage of samples correctly classified by the model and saves this information in a system catalog table. The default value is false.
noSnapshot — If you set this option to true, the database does not create an intermediate table that stores the result of the specified SQL statement, which the model uses for training. This option defaults to false. In this case, the database creates and uses the intermediate table. Setting this option to true is useful when the training set is fixed. If the training set is a table with modifications, set this option to false, as the decision tree trainer uses different data sets in different parts of the tree. Likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that executing that query twice generates the same 100 rows each time.
numSplits — If you set this option, the value must be an integer greater than 1. This value sets the maximum number of binary branches a continuous feature can consider. The default value is 32.
queryInternalParallelism — If you set this option, the database appends the USING PARALLELISM = <value> clause to all intermediate SQL queries the model executes during training, where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
resplitDepth — If you set this option, the value must be an integer type that sets the maximum depth at which tree nodes can be re-split during optimization. Controls how deep the algorithm searches for better split points. The default value is 6.
resplitThreshold — If you set this option, the value must be a decimal type that sets the minimum improvement threshold required to trigger a re-split operation. Lower values (e.g., 0.01) allow more aggressive re-splitting but can increase training time. Higher values (e.g., 1.0) require larger improvements to trigger re-splits. The default value is 0.1.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
skipLimitCheck — If you set this option to true, the model skips cardinality checks that throw errors when columns have too many values. The limit that this option checks is the same one that you specify using the distinctCountLimit option. The default value is false.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Simple Linear Regression
Model Options
Optional
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (R-squared) and the root mean squared error (RMSE). The default value is false.
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. The default value is true.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
threshold — This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any are greater than the threshold value, the threshold value is subtracted from them. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
Stacking
Model Options
Required
levelOneModel — This option specifies the level-1 child models of the stacking model. You must specify this value as a JSON array, where each object in the array has the four fields type (required), name, options, ignoreColumn, and extraCallArguments.
levelZeroModels — This option specifies the level-0 child models of the stacking model. You must specify this value as a JSON array, where each object in the array has the four fields type (required), name, options, ignoreColumn, and extraCallArguments.
Optional
extraColumnCount — If you set this option, the value must be an integer type that specifies how many non-feature columns there are in the input data. The default value is 0.
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
hasLabelColumn — If you set this option, the value must be a boolean type that specifies whether the input data includes a label column. The default value is true.
maxThreads — If you set this option, the option specifies the maximum number of parallel threads to use while the model trains. This value must be a positive integer. The default value is 16.
noSnapshot — If you set this option to true, the data source must not change. In this case, the database does not create an intermediate table that stores the result of the specified SQL statement, which the model uses for training a random forest. Child decision trees always have this option set to true, so the database does not create a separate intermediate table for each decision tree. The default value is false. Setting this option to true can speed up training when the training set is fixed.
preservedColumnsForLevelOne — If you set this option, this option specifies the columns from the original training data to pass as an input column to the level-1 model, in addition to the level-0 outputs. This value should be a comma-separated list of integers starting at 1. The default behavior is to preserve none of the columns.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Support Vector Machine
Model Options
Optional
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
functionN — By default, SVM uses a linear kernel. If you use a different kernel, you must provide a list of functions that are summed together, just like with linear combination regression. You must specify the first function using a key named ‘function1’. Subsequent functions must use keys with names that use subsequent values of N. You must specify functions in SQL syntax and use the variables x1, x2, … , xn to refer to the 1st, 2nd, and nth independent variables, respectively. You can specify the default linear kernel as: ‘function1’ → ‘x1’, ‘function2’ → ‘x2’, and so on. The model always adds a constant term equivalent to ‘functionN’ → ‘1.0’ that you do not need to specify explicitly.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
metrics — If you set this option to true, the model also calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. This option defaults to false.
normalize — If you set this option to true, the model automatically computes the mean and standard deviation of each feature and uses them to normalize the data during training. Defaults to true.
numEpochs — If you set this option, the value must be a positive integer type representing the maximum number of IRLS iterations during training. The default value is 20.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
regularizationCoefficient — If you set this option, the value must be a valid floating-point number. Use this option to control the balance of finding a wide margin and minimizing incorrectly classified points in the loss function. A larger (and positive) value makes having a wide margin around the hypersurface more important relative to the incorrectly classified points. Because of how the system implements SVM, the values for this option are likely different than values used in other common SVM implementations. The default value is 1.0 / 1000000.0.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
Vector Autoregression
Model Options
Required
numLags — Specify this option as a positive integer for the number of lags in the model.
numVariables — Specify this option as a positive integer for the number of variables in the model.
Optional
featureArray — If you set this option to true, the model expects only one array-type input column instead of multiple columns of training data. Each array row in the input column must be the same size. The default value is false.
featureArrayElements — If you set this option, the featureArray option must be set to true. The value must be a comma-separated list of integers representing indexes of the input array to use starting at index 1. The system uses all indexes of the input array by default.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
metrics — If you set this option to true, the function collects the metric for the coefficient of determination (R-squared). The default value is false.
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. The default value is true.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
suppressArrayLengthCheck — If you set this option, the featureArray option must be set to true. The system skips checking that the array length is the same size for all rows in the input. The default value is false.
threshold — This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from them. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
