SQL Reference
Machine Learning Model Functio...
Regression Models
{{ocientml}} supports regression models that investigate the relationship between independent variables and at least one dependent variable or outcome when trained, regression models can help establish the relationship between variables by estimating how one affects the other to create the model, use the create mlmodel syntax for details, see docid\ ydyzvrbfpkxymmiv3bid1 regression model common options the regression model types have many common options that provide similar functionality this table describes the purpose of the more common options option notes metrics the model supports the metrics option for all regression types if this option is set to true , then the model also calculates the coefficient of determination (also known as the r^2 value), the root mean squared error (rmse), and the adjusted r^2 value if the option is not specified, then the model defaults this option to false threshold most regression model types support the threshold option, which enables soft thresholding this model type is the same as lasso regression for simple linear regression however, this model type is not quite the same as lasso regression for other model types if you specify a non zero threshold, the model shifts coefficients towards zero by the amount of the threshold if this shift causes the coefficient to switch from positive to negative or vice versa, the model sets the coefficient to zero the system performs this coefficient adjustment as a final step before the system saves the model coefficients any reported coefficient of determination is based on the model prior to adjustment the adjustment always causes the coefficient of determination to be smaller however, this approach is useful as it prevents overfitting gamma some regression model types support the gamma option the value of this option is the tikhonov square matrix in the form {{a, b, c, }, {d, e, f, }, {g, h, i, }} if you do not specify this option, the model defaults the option to the 0 matrix, which equates to multiple linear regression without regularization you can perform ridge regression by using a gamma value this is a bit complicated for the polynomial and linear combination regression model types the system computes these model types by using multiple linear regression after applying the functions to be linearly combined to the input arguments in these cases, the system applies ridge regression after applying the model functions to the model input values weighted some regression model types also support weighted least squares the model types do not support generalized least squares, where the weight matrix can be non diagonal to use weighted least squares, add the option weighted and set it to true then, add one more column to the result set used to create the model this new column must appear last, after the dependent variable, and it specifies the weight for that sample model option names are case sensitive simple linear regression model type simple linear regression trains a new machine learning model of type \<model type> on the result set returned by the sql select statement after the database creates the model, \<model name> becomes an executable function in sql select statements create mlmodel model name type model type on ( select statement ) \[ options( option list ) ] the result set used as input to the model must have two numeric columns the first column is the independent variable (referred to as x ) the second column is the dependent variable (referred to as y ) the model finds the least squares best fit for y = ax + b the simple linear regression model type also supports the yintercept option, which is the y intercept of the resulting best fit line if you do not specify this option, the model does not force the y intercept to be any particular value, and the model uses least squares to find the best value instead if you force the y intercept to be a particular value, the uses least squares to find the best fit with that constraint model options optional metrics — if you set this option to true , the model collects quality metrics such as the coefficient of determination (r^2) and the root mean squared error (rmse) this option defaults to false yintercept — if you set this option, then the option must be a numeric value the system forces the specific y intercept (i e , the model value when x is zero) threshold — this option enables soft thresholding if you specify this option, then the option must be a positive numeric value after the model calculates the coefficients, if any are greater than the threshold value, the threshold value is subtracted from them if any coefficients are less than the negation of the threshold value, the model adds the threshold value to them for any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero normalize — if you set this option to true , the model uses auto scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable the model then unscales parameters so the persisted model operates in the original units this option defaults to true loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a simple linear regression model with y intercept 10 collect metrics for the model execution by setting the metrics option to true create mlmodel my model type simple linear regression on ( select x1, y from public my table ) options( 'yintercept' > '10', 'metrics' > 'true' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model after training, the model takes a single numeric argument that represents x and returns ax + b select my model(col1) from my table; after you execute a model, you can find the details about the execution results in the sys simple linear regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in https //docs ocient com/system catalog#pfo j multiple linear regression model type multiple linear regression multiple linear regression means that there is a vector of independent variables and the dependent variable is a scalar valued function of the vector input, and that it is linear in all vector components the model uses the result set as an input the result set has n columns that must be numeric the first n 1 columns are the independent variables (it can be considered a single independent variable that is a vector) the last column is the dependent variable the model finds the least squares best fit for y = a1 x1 + a2 x2 + + b , or in vector notation, y = ax + b , where a and x are vectors and the multiplication is a dot product model options optional metrics — if you set this option to true , the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (rmse) this option defaults to false threshold — this option enables soft thresholding if you specify this option, the option must be a positive numeric value after the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients if any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients for any coefficients that are between the negative and positive threshold values, the model sets the coefficients to zero weighted — if you set this option to true , the model performs weighted least squares regression, where each sample has a weight or importance associated with it in this case, there is an extra numeric column after the dependent variable that has the weight for the sample this option defaults to false gamma — if you set this option, the option must be a matrix the value represents a tikhonov gamma matrix used for regularization for details, see https //en wikipedia org/wiki/tikhonov regularization the model uses this option for ridge regression yintercept — if you set this option, then the option must be a numeric value the system forces the specific y intercept (i e , the model value when x is zero) normalize — if you set this option to true , the model uses auto scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable the model then unscales parameters so the persisted model operates in the original units this option defaults to true featurearray — if you set this option to true , the model expects only one array type column as input instead of multiple columns of training data each array row in the input column must be of the same size regardless of whether you use this option, the model treats training data the same with labeling and weight scoring the default value is false loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a multiple linear regression model collect metrics for the model execution by setting the metrics option to true create mlmodel my model type multiple linear regression on ( select x1, x2, x3, y from public my table ) options( 'metrics' > 'true' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable select my model(col1, col2, col3) from my table; after you execute a model, you can find the details about the execution results in the sys multiple linear regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in https //docs ocient com/system catalog#pfo j vector autoregression model type vector autoregression vector auto regression is a model that estimates the next value of multiple variables based on some number of lags of all variables, as a group the model tries to build the following estimate \<x1(t), x2(t)> based on x1(t 1), x2(t 1), x1(t 2), and x2(t 2) in this example, x1(t) means the value of x1 at time t , and x1(t 1) means the value of x1 at time t 1 (typically the previous sample time) the syntax \<x1(t), x2(t)> demonstrates that the result of the model is a row vector that contains all the predictions of the model and that all predictions rely on all lags of all variables when you create a model, the input result set must have one more column than the number of lags each column must be a row vector of a size equal to the number of variables the first column is the un lagged values, e g , {{x1, x2, x3}} the second column is the first lag for all variables, e g , {{x1 lag1, x2 lag2, x3 lag3}} for functions to help organize data before training a model, see docid\ vrwcjklivsomyas4ude8b model options required numvariables — specify this option as a positive integer for the number of variables in the model numlags — specify this option as a positive integer for the number of lags in the model optional metrics — if you set this option to true, the function collects the metric for the coefficient of determination (r^2) this option defaults to false threshold — this option enables soft thresholding if you specify this option, then the option must be a positive numeric value after the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from them if any coefficients are less than the negation of the threshold value, the model adds the threshold value to them for any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero normalize — if you set this option to true , the model uses auto scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable the model then unscales parameters so the persisted model operates in the original units this option defaults to true loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a vector autoregression model with three variables and four lags on each one collect metrics for the model execution by setting the metrics option to true the example uses the docid\ vrwcjklivsomyas4ude8b function to generate lagged columns of matrixes for the vector autoregression model create mlmodel my model type vector autoregression on ( select {{x1, x2, x3}}, lag vectors(x1, x2, x3, 1, 4) over(order by t) from public my table ) options( 'metrics' > 'true', 'numvariables' > '3', 'numlags' > '4' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model after training, the number of arguments you specify must be equal to the number of lags each of those arguments must be a lag vector that contains lagged values for all model variables at a specific lag period the first argument to the model function ( lag vector 1 ) represents lag 1, the second argument ( lag vector 2 ) represents lag 2, and so on the lag vectors function automatically generates these lag vectors with the appropriate structure in this example, lag 1 represents values from the previous row when ordered by time t , which means the model predicts the current values based on historical patterns from 1, 2, 3, and 4 rows back in the time ordered sequence select my model(lag vector 1, lag vector 2, lag vector 3, lag vector 4) as predicted, {{x1, x2, x3}} as actual from ( select x1, x2, x3, lag vectors(x1, x2, x3, 1, 4) over(order by t) from my table ); after you execute a model, you can find the details about the execution results in the sys vector autoregression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in https //docs ocient com/system catalog#pfo j polynomial regression model type polynomial regression polynomial regression means that there is one to many independent variables and one dependent variable, and that should be modeled in terms of an nth degree polynomial of the independent variables the order option must be set to a positive integer value that specifies the degree of the polynomial to use if you specify the option negativepowers and set it to true , then the model also includes terms with negative exponents, but still with the restriction that the sum of the absolute value of the exponents in a term will be less than or equal to the value specified on the order option regardless of whether you use the negativepowers option, the model computes a coefficient for every possible term that meets this restriction when you use the negativepowers option, the model contains many more terms for example, a quadratic model over two independent variables has six terms, but when you use the negativepowers option, the model has 13 terms the result set you specify as input to the model has n columns, which must all be numeric the first n 1 columns are the independent variables (it can be considered a single independent variable that is a vector) the last column is the dependent variable the model finds the least squares best fit of a sum of all possible combinations of terms where the degree is less than or equal to the value of the order option for example, with two independent variables ( x1 and x2 ) and order set to 2, the model is y = a1 x1^2 + a2 x2^2 + a3 x1 x2 + a4 x1 + a5 x2 + b model options required order — this option is the degree of the polynomial and must be set to a positive integer optional metrics — if you set this option to true , the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (rmse) this option defaults to false threshold — this option enables soft thresholding if you specify this option, then the option must be a positive numeric value after the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from them if any coefficients are less than the negation of the threshold value, the model adds the threshold value to them for any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero weighted — if you set this option to true , the model performs weighted least squares regression, where each sample has an associated weight when weighted, there is an extra numeric column after the dependent variable that has the weight for the sample this option defaults to false negativepowers — if you set this option to true , the model includes independent variables raised to negative powers these variables are called laurent polynomials the model generates all possible terms such that the sum of the absolute value of the power of each term in each product is less than or equal to the order for example, with two independent variables and the order set to 2, the model is y = a1 x1^2 + a2 x1^ 2 + a3 x2^2 + a4 x2^ 2 + a5 x1 x2 + a6 x1^ 1 x2 + a7 x1 x2^ 1 + a8 x1^ 1 x2^ 1 + a9 x1 + a10 x1^ 1 + a11 x2 + a12 x2^ 1 + b gamma — if you specify this option, the value must be a matrix the value represents a tikhonov gamma matrix that is used for regularization for details, see https //en wikipedia org/wiki/tikhonov regularization yintercept — if you set this option, then the option must be a numeric value the system forces the specific y intercept (i e , the model value when x is zero) normalize — if you set this option to true , the model uses auto scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable the model then unscales parameters so the persisted model operates in the original units this option defaults to true featurearray — if you set this option to true , the model expects only one array type column as input instead of multiple columns of training data each array row in the input column must be of the same size regardless of whether you use this option, the model treats training data the same with labeling and weight scoring the default value is false loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a polynomial regression model of degree 3 collect metrics for the model execution by setting the metrics option to true create mlmodel my model type polynomial regression on ( select x1, x2, x3, y from public my table ) options( 'order' > '3', 'metrics' > 'true' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable select my model(col1, col2, col3) from my table; after you execute a model, you can find the details about the execution results in the sys polynomial regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in https //docs ocient com/system catalog#pfo j linear combination regression model type linear combination regression linear combination regression is a model built on top of m independent variables and a single dependent variable but in this case, the function used to perform least squares regression is a linear combination of functions that you specify the general form is y = c0 + c1 f1(x1, x2, ) + f2(x1, x2, ) + the model determines the number of independent variables based on the number of columns in the sql statement on which the model is based there is always a column for the dependent variable, and there can be a weight column (see the weighted option) so, the number of independent variables is either one or two less than the number of columns in the result of the input sql statement the number of user specified functions for the model must be specified by defining function1 , function2 , and so on (keys in the options dictionary) as long as consecutive function key names exist, the model includes these names the model always includes a constant term the value strings for the functionn keys must be specified in sql syntax and should refer to x1, x2, for the model input independent variables the result set you specify as input to the model has n columns, which must all be numeric the first n 1 columns are independent variables (it can be considered a single independent variable that is a vector) the last column is the dependent variable the model finds the least squares best fit for a model of the form y = a1 f1(x1, x2, xn) + a2 f2(x1, x2, xn) + + an fn(x1, x2, nx) , where f1, f2, , fn are functions that are provided in a required option the system validates the user specified functions during prediction time if the function definition is invalid, the model throws an error during this time model options required functionn — you must specify the first function ( f1 ) using a key named 'function1' subsequent functions must use keys with names that use subsequent values of n you must specify functions in sql syntax and should use the variables x1, x2, , xn to refer to the 1st, 2nd, and nth independent variables, respectively for example, 'function1' > 'sin(x1 x2 + x3)', 'function2' > 'cos(x1 x3)' optional metrics — if you set this option to true , the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (rmse) this option defaults to false threshold — this option enables soft thresholding if you specify this option, the option must be a positive numeric value after the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients if any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients for any coefficients between the negative and positive threshold values, the model sets the coefficients to zero weighted — if you set this option to true , the model performs weighted least squares regression, where each sample has an associated weight or importance when weighted, there is an extra numeric column after the dependent variable that represents the weight of the sample this option defaults to false gamma — if you specify this option, it must be a matrix the value represents a tikhonov gamma matrix used for regularization for details, see https //en wikipedia org/wiki/tikhonov regularization yintercept — if you set this option, then the option must be a numeric value the system forces the specific y intercept (i e , the model value when x is zero) normalize — if you set this option to true , the model uses auto scaling to compute the mean and standard deviation of each input feature to normalize data during training, making training more numerically stable the model then unscales parameters so the persisted model operates in the original units this option defaults to true featurearray — if you set this option to true , the model expects only one array type column as input instead of multiple columns of training data each array row in the input column must be of the same size regardless of whether you use this option, the model treats training data the same with labeling and weight scoring the default value is false loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a linear combination regression with two functions that exhibit a wave like pattern create mlmodel my model type linear combination regression on ( select x1, x2, x3, y1 from public my table ) options( 'function1' > 'sin(x1 x2 + x3)', 'function2' > 'cos(x1 x3)' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable select my model(col1, col2, col3) from my table; after you execute a model, you can find the details about the execution results in the sys linear combination regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in https //docs ocient com/system catalog#pfo j nonlinear regression model type nonlinear regression nonlinear regression finds the best fit parameters of an arbitrary function using an arbitrary loss function this model type essentially provides direct access to capabilities that both logistic regression and support vector machines rely on the first n 1 columns are numeric independent variables, and the last column is the numeric dependent variable if you specify a function or loss function with an invalid value, the model throws an error from the select statements at training time model options required numparameters — specify this option as a positive integer this value specifies the number of different parameters to optimize, i e , how many different an variables there are in the user specified function function — specify the name of the function to fit the data in sql syntax use a1, a2, … to refer to the parameters for optimization use x1, x2, … to refer to the input features the model does not allow some sql functions the model allows only scalar expressions that can be represented internally as postfix expressions most notably, the model does not allow some functions that are rewritten as case statements (like least() and greatest() ) if your function is not allowed, the model displays an error message this model does not perform auto scaling optional metrics — if you set this option to true, the model calculates the coefficient of determination (r^2), the adjusted r^2, and the root mean squared error (rmse) however, the model calculates these quality metrics using the least squares loss function, and not the user specified loss function, because these metrics really only make sense for least squares lossfunction — if you specify this option, this parameter dictates the nonlinear optimizer the loss function uses on a per sample basis then, the actual loss function is the sum of this function applied to all samples the model should use the variable y to refer to the dependent variable in the training data and the variable f to refer to the computed estimate for a specified sample the default is least squares, which could be specified as (f y) (f y) mininitparamvalue — if you specify this option, the value must be a floating point number this option sets the minimum for initial parameter values in the optimization algorithm it defaults to 10 maxinitparamvalue — if you specify this option, the value must be a floating point number this option sets the maximum for initial parameter values in the optimization algorithm it defaults to 10 ridgecoefficient — if you specify this option, the value must be a double data type this option is the ridge coefficient for the loss function the default behavior is that the function ignores this option, effectively setting this option to 0 0 lassocoefficient — if you specify this option, the value must be a double data type this option is the lasso coefficient for the loss function the default behavior is that the function ignores this option, effectively setting this option to 0 0 samplesperthread — if you specify this option, the value must be a positive integer this parameter controls the target number of samples sent to each thread each thread independently computes a logistic regression model, and the models are all combined at the end the option defaults to 1 million featurearray — if you set this option to true , the model expects only one array type column as input instead of multiple columns of training data each array row in the input column must be of the same size regardless of whether you use this option, the model treats training data the same with labeling and weight scoring the default value is false finitedifferenceh — a double value representing the step size (h) for approximating gradients using the finite difference method the model uses this value only if analytical gradients are not active this value should generally be a small positive value, typically from 0 0001 (1e 4) to 0 0000001 (1e 7) if you do not specify this value, the default is 0 00001 (1e 5) gradientclipthreshold — a double value representing the absolute value threshold for any calculated gradient component to be clipped (i e , capped at the threshold value with its original sign) this threshold helps prevent issues with exploding gradients in unstable loss landscapes set this value to 0 or a negative value to disable gradient clipping if you do not specify this value, the default is 50 learningrate — a double value representing the base learning rate for the adam (adaptive moment estimation) machine learning optimizer adam adapts this rate individually for each parameter during training a common starting point for adam is 0 001 (1e 3) valid values must be positive and are generally in the range of 0 00001 (1e 5) to 0 01 (1e 2) a higher learning rate (e g , 0 01) can speed up training but can cause the optimizer to overshoot and miss optimal solutions, while a lower learning rate (e g , 0 00001) ensures more stable and precise convergence but can make training much slower if you do not specify this option, the system automatically selects a learning rate and adjust it over the course of training using the 1cycle learning schedule specifying a learning rate disables automatic adjustment and instead uses a fixed learning rate value minibatchsize — an integer value representing the number of data samples the model processes together before updating any model parameters this value must be positive a higher value can speed up computation and efficiently use hardware, but it can require more memory, slow down learning, or reduce the ability of the model to generalize a lower value allows for more frequent updates and can help the model generalize better, but it can make training noisier and less stable if you do not specify this value, the default is 16 numepochs — an integer value representing the maximum number of epochs, or full passes during training through the entire data set this value must be positive if you do not specify this value, the default is 128 loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a nonlinear regression that fits five parameters a1, a2, a3, a4, a5 with two independent variables x1, x2 create mlmodel my model type nonlinear regression on ( select x1, x2, y1 from public my table ) options( 'numparameters' > '5', 'function' > 'a1 sin(a2 x1 + a3) + a4 + a5 x2' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when the model executes, pass n 1 independent variables, and the model returns the estimate of the dependent variable select my model(x1, x2) from my table; after you execute a model, you can find the details about the execution results in the sys nonlinear regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in the https //docs ocient com/system catalog#pfo j regression tree model type regression tree regression tree is a supervised machine learning algorithm that predicts continuous numerical values by constructing a tree like model, similar to how a decision tree models work for classification for details, see docid\ oz7w9viouqc4ledje98yp a regression tree outputs continuous target values by learning simple decision rules inferred from the input features the algorithm recursively splits the data set based on feature values that minimize prediction error (typically measured by mean squared or absolute error) each internal node represents a decision based on a feature value, each branch represents the outcome of that decision, and each leaf node contains a predicted numerical value, typically the mean of the target values in that region the regression tree model is particularly effective for problems involving numerical prediction, such as price forecasting, demand estimation, or any scenario where the target variable is continuous rather than categorical model options optional continuousfeatures — a set of integer values s pecifying which input feature columns should be treated as continuous numeric variables for optimal tree splitting the value must be a comma separated list of feature column indexes (starting with 1) that correspond to numeric columns in your select statement, excluding the target column for example, 'continuousfeatures' > '1,2' would treat the first and second columns of your select statement as continuous if not specified, all features are treated as categorical, which can impair results for numeric columns like prices, measurements, or scores distinctcountlimit — if you set this option, the value must be a positive integer this value sets the limit for the number of distinct values a non continuous feature and the label can contain this option defaults to 256 doprune — if you set this option to true , the model uses pessimistic error pruning (pep) to prune the tree after training this option defaults to false maxcellstofetch — if you set this option, the value must be a positive integer controls the chunking behavior when fetching feature values during model training the limit represents the maximum number of data cells (calculated as number of columns × number of rows) that can be fetched in a single operation, not a byte limit when the expected data size exceeds this threshold, the algorithm switches to database based processing using sql queries instead of in memory processing this value defaults to 33,554,432 cells (calculated as 32 1024 1024) maxdepth — if you set this option, the value must be a positive integer this value sets the maximum allowable depth of the decision tree (the maximum number of features to split on) the default is unspecified, which means there is no maximum depth maxrows — if you set this option, the value must be a positive integer this option l imits the number of rows used for model training by creating a snapshot table with only the specified number of rows from the input query this option cannot be used with nosnapshot > true (attempting to set both results in an invalid argument error during model creation) when this option is unspecified, the model trains using all rows from the input query maxthreads — if you set this option, the value must be a positive integer this value indicates the maximum number of parallel threads to use while the model trains the default value is 1 metrics — if you set this option to true , the model also calculates the percentage of samples correctly classified by the model and saves this information in a catalog table this option defaults to false nosnapshot — if you set this option to true , the database does not create an intermediate table that stores the result of the specified sql statement, which the model uses for training this option defaults to false in this case, the database creates and uses the intermediate table setting this option to true is useful when the training set is fixed if the training set is a table with modifications, set this option to false , as the decision tree trainer uses different data sets in different parts of the tree likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that executing that query twice generates the same 100 rows each time numsplits — if you set this option, the value must be an integer greater than 1 this value sets the maximum number of binary branches a continuous feature can consider the default value is 32 skiplimitcheck — if you set this option to true , the model skips cardinality checks that throw errors when columns have too many values the limit that this option checks is the same one that is specified by the distinctcountlimit option this option defaults to false weighted — if you set this option, the model considers weights for labels if you set this option value to true , you must specify an additional column as a double in the training data for label weights rows with the same labels must have the same weights if you set this value to auto , the model calculates weights automatically by weighting each label according to the ratio of the count of the most frequent label to the count of the specified label as a result, the most frequent label has a weight of 1 0 , and the other label weights are higher this option defaults to false , which means all labels have equal weight splitmetric — controls which function is used to evaluate the quality of a split during tree construction supported options are gini impurity (default) — measures impurity based on class distributions macro f1 — uses macro averaged f1 score to guide splits micro f1 — uses micro averaged f1 score to guide splits weighted f1 — uses class frequency–weighted f1 score to guide splits enableresplits — a boolean value that determines if the tree can reuse the same continuous feature multiple times along a single branch (e g , split on x1 < 7 and later x1 < 3 ) this can capture more complex, range specific relationships if unspecified, this value defaults to true, meaning continuous features remain available for additional splits after use, allowing the tree to create more complex decision boundaries when set to false, the model marks continuous features as exhausted after their first use, and they cannot be used again in subsequent splits in the same tree resplitdepth — an integer value that sets the maximum depth at which tree nodes can be re split during optimization controls how deep the algorithm searches for better split points if unspecified, this value defaults to 6 resplitthreshold — a decimal value that sets the minimum improvement threshold required to trigger a re split operation lower values (e g 0 01) allow more aggressive re splitting but can increase training time higher values (e g 1 0) require larger improvements to trigger re splits if unspecified, this value defaults to 0 1 featurearray — if you set this option to true , the model expects only one array type column as input instead of multiple columns of training data each array row in the input column must be of the same size regardless of whether you use this option, the model treats training data the same with labeling and weight scoring the default value is false queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training, where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if you set this option to true , the database prevents the deletion of any intermediate tables that it creates during model training this option defaults to false execute the model this example creates a regression tree model using three input features ( feature1 , feature2 , feature3 ) to predict a continuous target variable the model builds a single tree structure that performs predictions by following a series of binary decisions based on feature values the model uses these options 'maxdepth' > '5' — limits the tree to a maximum depth of 5 levels, controlling model complexity and preventing overfitting a depth of 5 provides a good balance between model expressiveness and generalization capability 'enableresplits' > 'true' — allows the algorithm to reuse the same continuous feature multiple times along a single branch path with different threshold values create mlmodel my reg tree type regression tree on ( select feature1, feature2, feature3, target from training data ) options ( 'maxdepth' > '5', 'enableresplits' > 'true' ); call the trained model in a select query, passing in the feature columns as arguments select my reg tree(feature1, feature2, feature3) from scoring data; for the output of the model, see the sys regression tree models system catalog table gradient boosted trees model type gradient boosted trees gradient boosted trees (gbt) is an ensemble machine learning algorithm that builds a sequence of decision trees where each new tree is trained to correct the prediction errors of the previous trees unlike the random forest model, which creates trees independently in parallel, gbt uses a sequential approach based on gradient descent optimization, where each tree learns to predict the residual errors (gradients) of the current ensemble this iterative error correction process allows the model to capture complex, non linear patterns by progressively refining predictions through multiple weak learners the algorithm supports both regression and classification tasks through different loss functions (use the lossfunction option to toggle whether the model performs regression or classification tasks) for regression with squared error loss, trees directly predict residual errors, while classification with logistic loss maintains raw scores transformed through sigmoid or softmax functions model options required numchildren — an integer value representing the total number of trees to build sequentially each tree learns to correct the errors of the previous trees learningrate — a decimal value between 0 0 and 1 0 that tunes how much the model learns from each successive child optional lossfunction — a string value that determines how the model calculates prediction errors and what type of problem it solves accepted values include 'squared error' — configures the model for regression tasks calculates errors as the squared difference between predicted and actual values use this for predicting continuous numeric values (e g , prices, temperatures, quantities) the target column must contain numeric values this is the default value 'log loss' — configures the model for classification tasks this model uses logistic loss to calculate prediction errors for probability based predictions fractionselected — a decimal value greater than 0 0 and less than or equal to 1 0 that specifies what fraction of the training rows to randomly select for each boosting iteration when you specify a value, the algorithm uses ceil(fractionselected totalrows) per iteration if you do not specify this value, the model uses all rows in their original order inputsperchild — an integer greater than or equal to 1 that specifies the number of input features each boosting tree should use this value cannot exceed the number of input features available in the data set when you specify this value, the algorithm deterministically cycles through pre enumerated feature subsets to ensure each tree uses exactly this many features when you do not specify this value, the model uses all available features for each tree maxdepth — a positive integer value that represents the maximum allowable depth of the decision trees the default value is 3 maxthreads — a positive integer value that sets t he maximum number of parallel threads to use for training each child decision tree parallel threads do not affect the sequential method of training each tree the default value is 16 enableresplits — a boolean value that determines if the tree can reuse the same continuous feature multiple times along a single branch (e g , split on x1 < 7 and later x1 < 3 ) this action can capture more complex, range specific relationships if you do not specify this value, the default value is true , meaning continuous features remain available for additional splits after use, which allows the tree to create more complex decision boundaries when you set this value to false , the model marks continuous features as exhausted after their first use, and the model cannot use these features again in subsequent splits in the same tree resplitdepth — an integer value that sets the maximum depth at which tree nodes can be re split during optimization controls how deep the algorithm searches for better split points if you do not specify this value, the default value is 6 resplitthreshold — a decimal value that sets the minimum improvement threshold required to trigger a re split operation lower values (e g , 0 01) allow more aggressive re splitting but can increase training time higher values (e g , 1 0) require larger improvements to trigger re splits if you do not specify this value, the default value is 0 1 maxcellstofetch — an integer value that d etermines the memory threshold to switch from training with system memory to training with sql queries in the database in memory training is generally faster, but is limited by the available sql node memory if the size of a training data subset exceeds this value, then the system performs training operations using sql queries the default value is 33,554,432 (calculated as 32 1024 1024 ) metrics — a boolean value if you set this value to true , the system calculates and stores final model metrics ( r²/rmse for regression or accuracy/logloss for classification) on the training data featurearray — a boolean value if you set this value to true , the model expects all input features to be in a single array column instead of separate columns continuousfeatures — a set of integer values that s pecify the input feature columns that the model should treat as continuous numeric variables for optimal tree splitting the value must be a comma separated list of feature column indexes (starting with 1) that correspond to numeric columns in your select sql statement, excluding the target column for example, 'continuousfeatures' > '1,2' would treat the first and second columns of your select statement as continuous if you do not specify this value, the model treats all features as categorical, which can impair results for numeric columns like prices, measurements, or scores execute the model this example creates a gbt model to predict house prices using a sequential ensemble approach the model trains on four input features ( square feet , bedrooms , bathrooms , age years ) to predict the target variable ( price ), filtering out any records with missing price data to ensure clean training the model uses these options 'numchildren' > '100' — builds a sequence of 100 regression trees each tree learns to correct the prediction errors made by all previous trees in the sequence this relatively high number allows the model to capture complex patterns but requires careful tuning of the learning rate to prevent overfitting 'learningrate' > '0 1' — sets a conservative learning rate where each new tree contribution is scaled by 10 percent before being added to the ensemble this moderate rate balances training speed with model stability 'lossfunction' > 'squared error' — configures the model for regression by using squared error loss, which measures prediction accuracy as the squared difference between predicted and actual prices this specification is the standard choice for continuous numeric predictions, ensuring the model optimizes for minimizing price prediction errors create mlmodel house price gbt type gradient boosted trees on ( select square feet, bedrooms, bathrooms, age years, price from housing data where price is not null ) options ( 'numchildren' > '100', 'learningrate' > '0 1', 'lossfunction' > 'squared error' ); call the trained model in a select query, passing in the feature columns as arguments select house id, house price gbt(square feet, bedrooms, bathrooms, age years) as predicted price from new listings where square feet is not null and bedrooms is not null and bathrooms is not null and age years is not null; for the output of the model, see the sys gradient boosted tree models system catalog table related links docid\ pstuxvbgmg7vzwjjvyxpd docid\ oz7w9viouqc4ledje98yp docid\ jyvggzczktiksnbvb4msz docid\ tbjvq0xtd tcxq17hm nc