supports regression models that investigate the relationship between independent variables and at least one dependent variable or outcome. When trained, regression models can help establish the relationship between variables by estimating how one affects the other. To create the model, use theDocumentation Index
Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
Use this file to discover all available pages before exploring further.
CREATE MLMODEL syntax. For details, see CREATE MLMODEL.
Regression Model Common Options
The regression model types have many common options that provide similar functionality. This table describes the purpose of the more common options.| Option | Notes |
|---|---|
metrics | The model supports the metrics option for all regression types. If this option is set to true, then the model also calculates the coefficient of determination (also known as the R^2 value), the root mean squared error (RMSE), and the adjusted R^2 value. If the option is not specified, then the model defaults this option to false. |
threshold | Most regression model types support the threshold option, which enables soft-thresholding. This model type is the same as lasso regression for simple linear regression. However, this model type is not quite the same as lasso regression for other model types. If you specify a non-zero threshold, the model shifts coefficients towards zero by the amount of the threshold. If this shift causes the coefficient to switch from positive to negative or vice versa, the model sets the coefficient to zero. The system performs this coefficient adjustment as a final step before the system saves the model coefficients. Any reported coefficient of determination is based on the model prior to adjustment. The adjustment always causes the coefficient of determination to be smaller. However, this approach is useful as it prevents overfitting. |
gamma | Some regression model types support the gamma option. The value of this option is the Tikhonov square matrix in the form: {{a, b, c, ...}, {d, e, f, ...}, {g, h, i, ...}}. If you do not specify this option, the model defaults the option to the 0 matrix, which equates to multiple linear regression without regularization. You can perform ridge regression by using a gamma value. This is a bit complicated for the polynomial and linear combination regression model types. The system computes these model types by using multiple linear regression after applying the functions to be linearly combined to the input arguments. In these cases, the system applies ridge regression after applying the model functions to the model input values. |
weighted | Some regression model types also support weighted least squares. The model types do not support generalized least squares, where the weight matrix can be non-diagonal. To use weighted least squares, add the option weighted and set it to true. Then, add one more column to the result set used to create the model. This new column must appear last, after the dependent variable, and it specifies the weight for that sample. |
Model option names are case-sensitive.
Simple Linear Regression
Model Type:SIMPLE LINEAR REGRESSION
Trains a new machine learning model of type <model_type> on the result set returned by the SQL SELECT statement. After the database creates the model, <model_name> becomes an executable function in SQL SELECT statements.
SQL
x). The second column is the dependent variable (referred to as y). The model finds the least squares best fit for y = ax + b.
The simple linear regression model type also supports the yIntercept option, which is the y-intercept of the resulting best-fit line. If you do not specify this option, the model does not force the y-intercept to be any particular value, and the model uses least squares to find the best value instead. If you force the y-intercept to be a particular value, the uses least squares to find the best fit with that constraint.
Model Options
Optional
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2) and the root mean squared error (RMSE). This option defaults to false.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
threshold — This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any are greater than the threshold value, the threshold value is subtracted from them. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. This option defaults to true.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a simple linear regression model with y-intercept10. Collect metrics for the model execution by setting the metrics option to true.
SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model after training, the model takes a single numeric argument that represents x and returns ax + b.
SQL
sys.simple_linear_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Multiple Linear Regression
Model Type:MULTIPLE LINEAR REGRESSION
Multiple linear regression means that there is a vector of independent variables and the dependent variable is a scalar-valued function of the vector input, and that it is linear in all vector components.
The model uses the result set as an input. The result set has N columns that must be numeric. The first N - 1 columns are the independent variables (it can be considered a single independent variable that is a vector). The last column is the dependent variable. The model finds the least squares best fit for y = a1 * x1 + a2 * x2 + ... + b, or in vector notation, y = ax + b, where a and x are vectors and the multiplication is a dot product.
Model Options
Optional
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (RMSE). This option defaults to false.
threshold — This option enables soft thresholding. If you specify this option, the option must be a positive numeric value. After the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients. For any coefficients that are between the negative and positive threshold values, the model sets the coefficients to zero.
weighted — If you set this option to true, the model performs weighted least squares regression, where each sample has a weight or importance associated with it. In this case, there is an extra numeric column after the dependent variable that has the weight for the sample. This option defaults to false.
gamma — If you set this option, the option must be a matrix. The value represents a Tikhonov gamma matrix used for regularization. For details, see Tikhonov regularization. The model uses this option for ridge regression.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. This option defaults to true.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a multiple linear regression model. Collect metrics for the model execution by setting themetrics option to true.
SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable.
SQL
sys.multiple_linear_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Vector Autoregression
Model Type:VECTOR AUTOREGRESSION
Vector auto-regression is a model that estimates the next value of multiple variables based on some number of lags of all variables, as a group. The model tries to build the following:
Text
x1(t) means the value of x1 at time t, and x1(t-1) means the value of x1 at time t-1 (typically the previous sample time). The syntax <x1(t), x2(t)> demonstrates that the result of the model is a row vector that contains all the predictions of the model and that all predictions rely on all lags of all variables.
When you create a model, the input result set must have one more column than the number of lags. Each column must be a row vector of a size equal to the number of variables. The first column is the un-lagged values, e.g., {{x1, x2, x3}}. The second column is the first lag for all variables, e.g., {{x1_lag1, x2_lag2, x3_lag3}}.
For functions to help organize data before training a model, see Data Preparation.
Model Options
Required
numVariables — Specify this option as a positive integer for the number of variables in the model.
numLags — Specify this option as a positive integer for the number of lags in the model.
Optional
metrics — If you set this option to true, the function collects the metric for the coefficient of determination (r^2). This option defaults to false.
threshold — This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from them. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. This option defaults to true.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a vector autoregression model with three variables and four lags on each one. Collect metrics for the model execution by setting themetrics option to true.
The example uses the LAG_VECTORS function to generate lagged columns of matrices for the vector autoregression model.
SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model after training, the number of arguments you specify must be equal to the number of lags. Each of those arguments must be a lag vector that contains lagged values for all model variables at a specific lag period. The first argument to the model function (lag_vector_1) represents lag 1, the second argument (lag_vector_2) represents lag 2, and so on. The LAG_VECTORS function automatically generates these lag vectors with the appropriate structure.
In this example, lag 1 represents values from the previous row when ordered by time t, which means the model predicts the current values based on historical patterns from 1, 2, 3, and 4 rows back in the time-ordered sequence.
SQL
sys.vector_autoregression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Polynomial Regression
Model Type:POLYNOMIAL REGRESSION
Polynomial regression means that there is one to many independent variables and one dependent variable, and that should be modeled in terms of an nth-degree polynomial of the independent variables.
The order option must be set to a positive integer value that specifies the degree of the polynomial to use. If you specify the option negativePowers and set it to true, then the model also includes terms with negative exponents, but still with the restriction that the sum of the absolute value of the exponents in a term will be less than or equal to the value specified on the order option. Regardless of whether you use the negativePowers option, the model computes a coefficient for every possible term that meets this restriction. When you use the negativePowers option, the model contains many more terms. For example, a quadratic model over two independent variables has six terms, but when you use the negativePowers option, the model has 13 terms.
The result set you specify as input to the model has N columns, which must all be numeric. The first N-1 columns are the independent variables (it can be considered a single independent variable that is a vector). The last column is the dependent variable. The model finds the least squares best fit of a sum of all possible combinations of terms where the degree is less than or equal to the value of the order option. For example, with two independent variables (x1 and x2) and order set to 2, the model is y = a1*x1^2 + a2*x2^2 + a3*x1*x2 + a4*x1 + a5*x2 + b.
Model Options
Required
order — This option is the degree of the polynomial and must be set to a positive integer.
Optional
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (RMSE). This option defaults to false.
threshold — This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from them. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
weighted — If you set this option to true, the model performs weighted least squares regression, where each sample has an associated weight. When weighted, there is an extra numeric column after the dependent variable that has the weight for the sample. This option defaults to false.
negativePowers — If you set this option to true, the model includes independent variables raised to negative powers. These variables are called Laurent polynomials. The model generates all possible terms such that the sum of the absolute value of the power of each term in each product is less than or equal to the order. For example, with two independent variables and the order set to 2, the model is: y = a1*x1^2 + a2*x1^-2 + a3*x2^2 + a4*x2^-2 + a5*x1*x2 + a6*x1^-1*x2 + a7*x1*x2^-1 + a8*x1^-1*x2^-1 + a9*x1 + a10*x1^-1 + a11*x2 + a12*x2^-1 + b.
gamma — If you specify this option, the value must be a matrix. The value represents a Tikhonov gamma matrix that is used for regularization. For details, see Tikhonov regularization.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. This option defaults to true.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a polynomial regression model of degree 3. Collect metrics for the model execution by setting themetrics option to true.
SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable.
SQL
sys.polynomial_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Linear Combination Regression
Model Type:LINEAR COMBINATION REGRESSION
Linear combination regression is a model built on top of m independent variables and a single dependent variable. But in this case, the function used to perform least-squares regression is a linear combination of functions that you specify. The general form is y = c0 + c1 * f1(x1, x2, ...) + f2(x1, x2, ...) + ....
The model determines the number of independent variables based on the number of columns in the SQL statement on which the model is based. There is always a column for the dependent variable, and there can be a weight column (see the weighted option). So, the number of independent variables is either one or two less than the number of columns in the result of the input SQL statement. The number of user-specified functions for the model must be specified by defining function1, function2, and so on (keys in the options dictionary). As long as consecutive function key names exist, the model includes these names. The model always includes a constant term. The value strings for the functionN keys must be specified in SQL syntax and should refer to x1, x2, ... for the model input independent variables.
The result set you specify as input to the model has N columns, which must all be numeric. The first N-1 columns are independent variables (it can be considered a single independent variable that is a vector). The last column is the dependent variable. The model finds the least squares best fit for a model of the form y = a1 * f1(x1, x2, ... xn) + a2 * f2(x1, x2, ... xn) + ... + an * fn(x1, x2, ... nx), where f1, f2, ..., fn are functions that are provided in a required option.
The system validates the user-specified functions during prediction time. If the function definition is invalid, the model throws an error during this time.
Model Options
Required
functionN — You must specify the first function (f1) using a key named 'function1'. Subsequent functions must use keys with names that use subsequent values of N. You must specify functions in SQL syntax and should use the variables x1, x2, ..., xn to refer to the 1st, 2nd, and nth independent variables, respectively. For example,'function1' -> 'sin(x1 * x2 + x3)', 'function2' -> 'cos(x1 * x3)'.
Optional
metrics — If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (RMSE). This option defaults to false.
threshold — This option enables soft thresholding. If you specify this option, the option must be a positive numeric value. After the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients. For any coefficients between the negative and positive threshold values, the model sets the coefficients to zero.
weighted — If you set this option to true, the model performs weighted least squares regression, where each sample has an associated weight or importance. When weighted, there is an extra numeric column after the dependent variable that represents the weight of the sample. This option defaults to false.
gamma — If you specify this option, it must be a matrix. The value represents a Tikhonov gamma matrix used for regularization. For details, see Tikhonov regularization.
yIntercept — If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e., the model value when x is zero).
normalize — If you set this option to true, the model uses auto-scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable. The model then unscales parameters so the persisted model operates in the original units. This option defaults to true.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a linear combination regression with two functions that exhibit a wave-like pattern.SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable.
SQL
sys.linear_combination_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Nonlinear Regression
Model Type:NONLINEAR REGRESSION
Nonlinear regression finds the best-fit parameters of an arbitrary function using an arbitrary loss function. This model type essentially provides direct access to capabilities that both logistic regression and support vector machines rely on. The first N - 1 columns are numeric independent variables, and the last column is the numeric dependent variable.
If you specify a function or loss function with an invalid value, the model throws an error from the
SELECT statements at training time.Model Options
Required
numParameters — Specify this option as a positive integer. This value specifies the number of different parameters to optimize, i.e., how many different aN variables there are in the user-specified function.
function — Specify the name of the function to fit the data in SQL syntax. Use a1, a2, … to refer to the parameters for optimization. Use x1, x2, … to refer to the input features. The model does not allow some SQL functions. The model allows only scalar expressions that can be represented internally as postfix expressions. Most notably, the model does not allow some functions that are rewritten as CASE statements (like least() and greatest()). If your function is not allowed, the model displays an error message.
This model does not perform auto-scaling.
Optional
metrics — If you set this option to true, the model calculates the coefficient of determination (r^2), the adjusted r^2, and the root mean squared error (RMSE). However, the model calculates these quality metrics using the least squares loss function, and not the user-specified loss function, because these metrics really only make sense for least squares.
lossFunction — If you specify this option, this parameter dictates the nonlinear optimizer the loss function uses on a per-sample basis. Then, the actual loss function is the sum of this function applied to all samples. The model should use the variable y to refer to the dependent variable in the training data and the variable f to refer to the computed estimate for a specified sample. The default is least squares, which could be specified as (f-y)*(f-y).
minInitParamValue — A DOUBLE value setting the lower bound for random initial parameter values. The default value is -1.
maxInitParamValue — A DOUBLE value setting the upper bound for random initial parameter values. The default value is 1.
ridgeCoefficient — If you specify this option, the value must be a double data type. This option is the ridge coefficient for the loss function. The default behavior is that the function ignores this option, effectively setting this option to 0.0.
lassoCoefficient — If you specify this option, the value must be a double data type. This option is the lasso coefficient for the loss function. The default behavior is that the function ignores this option, effectively setting this option to 0.0.
samplesPerThread — If you specify this option, the value must be a positive integer. This parameter controls the target number of samples sent to each thread. Each thread independently computes a logistic regression model, and the models are all combined at the end. The option defaults to 1 million.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
finiteDifferenceH — A DOUBLE value representing the step size (h) for approximating gradients using the finite difference method. The model uses this value only if analytical gradients are not active.
This value should generally be a small positive value, typically from 0.0001 (1e-4) to 0.0000001 (1e-7).
If you do not specify this value, the default is 0.00001 (1e-5).
gradientClipThreshold — A DOUBLE value representing the absolute value threshold for any calculated gradient component to be clipped (i.e., capped at the threshold value with its original sign). This threshold helps prevent issues with exploding gradients in unstable loss landscapes.
Set this value to 0 or a negative value to disable gradient clipping.
If you do not specify this value, the default is ie6.
learningRate — A DOUBLE value representing the base learning rate for the Adam (Adaptive Moment Estimation) machine learning optimizer. Adam adapts this rate individually for each parameter during training. A common starting point for Adam is 0.001 (1e-3). Valid values must be positive and are generally in the range of 0.00001 (1e-5) to 0.01 (1e-2).
A higher learning rate (e.g., 0.01) can speed up training but can cause the optimizer to overshoot and miss optimal solutions, while a lower learning rate (e.g., 0.00001) ensures more stable and precise convergence but can make training much slower.
If you do not specify this option, the system automatically selects a learning rate and adjust it over the course of training using the 1Cycle learning schedule. Specifying a learning rate disables automatic adjustment and instead uses a fixed learning rate value.
miniBatchSize — An INTEGER value representing the number of data samples the model processes together before updating any model parameters. This value must be positive.
A higher value can speed up computation and efficiently use hardware, but it can require more memory, slow down learning, or reduce the ability of the model to generalize. A lower value allows for more frequent updates and can help the model generalize better, but it can make training noisier and less stable.
If you do not specify this value, the default is 16.
numEpochs — A positive INTEGER for the maximum number of training epochs (full passes over the data). The default is 200 for Adam optimization or 100 for Levenberg-Marquardt. Training can typically stop earlier than this threshold because of automatic early stopping.
loadBalance — If you set this option, the database appends the USING load_balance_shuffle = <value> clause to all intermediate SQL queries the model executes during training where value is the specified option value (true or false). The default value is unspecified. In this case, the database does not add this clause.
queryInternalParallelism — If you set this option, the database appends the USING parallelism = <value> clause to all intermediate SQL queries the model executes during training where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to false, the database deletes any intermediate tables created during model training. If you set this option to true, the database prevents the deletion of any intermediate tables created during model training. The default value is false.
Execute the Model
Create a nonlinear regression that fits five parametersa1, a2, a3, a4, a5 with two independent variables x1, x2.
SQL
sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When the model executes, pass N - 1 independent variables, and the model returns the estimate of the dependent variable.
SQL
sys.nonlinear_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in the System Catalog.
Regression Tree
Model Type:REGRESSION TREE
Regression Tree is a supervised machine learning algorithm that predicts continuous numerical values by constructing a tree-like model, similar to how Decision Tree models work for classification. For details, see Decision Tree (Classification).
A Regression Tree outputs continuous target values by learning simple decision rules inferred from the input features. The algorithm recursively splits the data set based on feature values that minimize prediction error (typically measured by mean squared or absolute error). Each internal node represents a decision based on a feature value, each branch represents the outcome of that decision, and each leaf node contains a predicted numerical value, typically the mean of the target values in that region.
The Regression Tree model is particularly effective for problems involving numerical prediction, such as price forecasting, demand estimation, or any scenario where the target variable is continuous rather than categorical.
Model Options
Optional
continuousFeatures — A set of INTEGER values specifying which input feature columns should be treated as continuous numeric variables for optimal tree splitting. The value must be a comma-separated list of feature column indexes (starting with 1) that correspond to numeric columns in your SELECT statement, excluding the target column. For example, 'continuousFeatures' -> '1,2' would treat the first and second columns of your SELECT statement as continuous. If not specified, all features are treated as categorical, which can impair results for numeric columns like prices, measurements, or scores.
distinctCountLimit — If you set this option, the value must be a positive integer. This value sets the limit for the number of distinct values a non-continuous feature and the label can contain. This option defaults to 256.
maxCellsToFetch — If you set this option, the value must be a positive integer. Controls the chunking behavior when fetching feature values during model training. The limit represents the maximum number of data cells (calculated as number of columns × number of rows) that can be fetched in a single operation, not a byte limit. When the expected data size exceeds this threshold, the algorithm switches to database-based processing using SQL queries instead of in-memory processing. This value defaults to 33,554,432 cells (calculated as 32 * 1024 * 1024).
maxDepth — If you set this option, the value must be a positive integer. This value sets the maximum allowable depth of the decision tree (the maximum number of features to split on). The default is unspecified, which means there is no maximum depth.
maxRows — If you set this option, the value must be a positive integer. This option limits the number of rows used for model training by creating a snapshot table with only the specified number of rows from the input query. This option cannot be used with noSnapshot -> true (attempting to set both results in an invalid argument error during model creation). When this option is unspecified, the model trains using all rows from the input query.
maxThreads — If you set this option, the value must be a positive integer. This value indicates the maximum number of parallel threads to use while the model trains. The default value is 2.
metrics — If you set this option to true, the model also calculates the percentage of samples correctly classified by the model and saves this information in a catalog table. This option defaults to false.
noSnapshot — If you set this option to true, the database does not create an intermediate table that stores the result of the specified SQL statement, which the model uses for training. This option defaults to false. In this case, the database creates and uses the intermediate table. Setting this option to true is useful when the training set is fixed. If the training set is a table with modifications, set this option to false, as the decision tree trainer uses different data sets in different parts of the tree. Likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that executing that query twice generates the same 100 rows each time.
numSplits — If you set this option, the value must be an integer greater than 1. This value sets the maximum number of binary branches a continuous feature can consider. The default value is 32.
skipLimitCheck — If you set this option to true, the model skips cardinality checks that throw errors when columns have too many values. The limit that this option checks is the same one that is specified by the distinctCountLimit option. This option defaults to false.
enableResplits — A Boolean value that determines if the tree can reuse the same continuous feature multiple times along a single branch (e.g., split on x1 < 7 and later x1 < 3). This can capture more complex, range-specific relationships. If unspecified, this value defaults to true, meaning continuous features remain available for additional splits after use, allowing the tree to create more complex decision boundaries. When set to false, the model marks continuous features as exhausted after their first use, and they cannot be used again in subsequent splits in the same tree.
resplitDepth — An INTEGER value that sets the maximum depth at which tree nodes can be re-split during optimization. Controls how deep the algorithm searches for better split points. If unspecified, this value defaults to 6.
resplitThreshold — A DECIMAL value that sets the minimum improvement threshold required to trigger a re-split operation. Lower values (e.g. 0.01) allow more aggressive re-splitting but can increase training time. Higher values (e.g. 1.0) require larger improvements to trigger re-splits. If unspecified, this value defaults to 0.1.
featureArray — If you set this option to true, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is false.
queryInternalParallelism — If you set this option, the database appends the USING PARALLELISM = <value> clause to all intermediate SQL queries the model executes during training, where value is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.
skipDropTable — If you set this option to true, the database prevents the deletion of any intermediate tables that it creates during model training. This option defaults to false.
Execute the Model
This example creates a regression tree model using three input features (feature1, feature2, feature3) to predict a continuous target variable. The model builds a single tree structure that performs predictions by following a series of binary decisions based on feature values.
The model uses these options:
'maxDepth' -> '5'— Limits the tree to a maximum depth of 5 levels, controlling model complexity and preventing overfitting. A depth of 5 provides a good balance between model expressiveness and generalization capability.'enableResplits' -> 'true'— Allows the algorithm to reuse the same continuous feature multiple times along a single branch path with different threshold values.
SQL
SELECT query, passing in the feature columns as arguments.
SQL
sys.regression_tree_models system catalog table.
Gradient Boosted Trees
Model Type:GRADIENT BOOSTED TREES
Gradient Boosted Trees (GBT) is an ensemble machine learning algorithm that builds a sequence of decision trees where each new tree is trained to correct the prediction errors of the previous trees. Unlike the Random Forest model, which creates trees independently in parallel, GBT uses a sequential approach based on gradient descent optimization, where each tree learns to predict the residual errors (gradients) of the current ensemble. This iterative error-correction process allows the model to capture complex, non-linear patterns by progressively refining predictions through multiple weak learners.
The algorithm supports both regression and classification tasks through different loss functions (use the lossFunction option to toggle whether the model performs regression or classification tasks). For regression with squared error loss, trees directly predict residual errors, while classification with logistic loss maintains raw scores transformed through sigmoid or softmax functions.
Model Options
Required
numChildren — An INTEGER value representing the total number of trees to build sequentially. Each tree learns to correct the errors of the previous trees.
learningRate — A DECIMAL value between 0.0 and 1.0 that tunes how much the model learns from each successive child.
Optional
lossFunction — A string value that determines how the model calculates prediction errors and what type of problem it solves. Accepted values include:
'squared_error'— Configures the model for regression tasks. Calculates errors as the squared difference between predicted and actual values. Use this for predicting continuous numeric values (e.g., prices, temperatures, quantities). The target column must contain numeric values. This is the default value.'log_loss'— Configures the model for classification tasks. This model uses logistic loss to calculate prediction errors for probability-based predictions.
fractionSelected — A DECIMAL value greater than 0.0 and less than or equal to 1.0 that specifies what fraction of the training rows to randomly select for each boosting iteration. When you specify a value, the algorithm uses CEIL(fractionSelected * totalRows) per iteration. If you do not specify this value, the model uses all rows in their original order.
inputsPerChild — An INTEGER greater than or equal to 1 that specifies the number of input features each boosting tree should use. This value cannot exceed the number of input features available in the data set. When you specify this value, the algorithm deterministically cycles through pre-enumerated feature subsets to ensure each tree uses exactly this many features. When you do not specify this value, the model uses all available features for each tree.
maxDepth — A positive INTEGER value that represents the maximum allowable depth of the decision trees. The default value is 3.
maxThreads — A positive INTEGER value that sets the maximum number of parallel threads to use for training each child decision tree. Parallel threads do not affect the sequential method of training each tree. The default value is 16.
enableResplits — A Boolean value that determines if the tree can reuse the same continuous feature multiple times along a single branch (e.g., split on x1 < 7 and later x1 < 3). This action can capture more complex, range-specific relationships. If you do not specify this value, the default value is true, meaning continuous features remain available for additional splits after use, which allows the tree to create more complex decision boundaries. When you set this value to false, the model marks continuous features as exhausted after their first use, and the model cannot use these features again in subsequent splits in the same tree.
resplitDepth — An INTEGER value that sets the maximum depth at which tree nodes can be re-split during optimization. Controls how deep the algorithm searches for better split points. If you do not specify this value, the default value is 6.
resplitThreshold — A DECIMAL value that sets the minimum improvement threshold required to trigger a re-split operation. Lower values (e.g., 0.01) allow more aggressive re-splitting but can increase training time. Higher values (e.g., 1.0) require larger improvements to trigger re-splits. If you do not specify this value, the default value is 0.1.
maxCellsToFetch — An INTEGER value that determines the memory threshold to switch from training with system memory to training with SQL queries in the database. In-memory training is generally faster, but is limited by the available SQL Node memory. If the size of a training data subset exceeds this value, then the system performs training operations using SQL queries. The default value is 33,554,432 (calculated as 32 * 1024 * 1024).
metrics — A Boolean value. If you set this value to true, the system calculates and stores final model metrics (R²/RMSE for regression or Accuracy/LogLoss for classification) on the training data.
featureArray — A Boolean value. If you set this value to true, the model expects all input features to be in a single ARRAY column instead of separate columns.
continuousFeatures — A set of INTEGER values that specify the input feature columns that the model should treat as continuous numeric variables for optimal tree splitting. The value must be a comma-separated list of feature column indexes (starting with 1) that correspond to numeric columns in your SELECT SQL statement, excluding the target column. For example, 'continuousFeatures' -> '1,2' would treat the first and second columns of your SELECT statement as continuous. If you do not specify this value, the model treats all features as categorical, which can impair results for numeric columns like prices, measurements, or scores.
Execute the Model
This example creates a GBT model to predict house prices using a sequential ensemble approach. The model trains on four input features (square_feet, bedrooms, bathrooms, age_years) to predict the target variable (price), filtering out any records with missing price data to ensure clean training.
The model uses these options:
'numChildren' -> '100'— Builds a sequence of 100 regression trees. Each tree learns to correct the prediction errors made by all previous trees in the sequence. This relatively high number allows the model to capture complex patterns but requires careful tuning of the learning rate to prevent overfitting.'learningRate' -> '0.1'— Sets a conservative learning rate where each new tree contribution is scaled by 10 percent before being added to the ensemble. This moderate rate balances training speed with model stability.'lossFunction' -> 'squared_error'— Configures the model for regression by using squared error loss, which measures prediction accuracy as the squared difference between predicted and actual prices. This specification is the standard choice for continuous numeric predictions, ensuring the model optimizes for minimizing price prediction errors.
SQL
SELECT query, passing in the feature columns as arguments.
SQL
sys.gradient_boosted_trees_models system catalog table.

