SQL Reference
Machine Learning Model Functio...
Regression Models
{{ocientml}} supports regression models that investigate the relationship between independent variables and at least one dependent variable or outcome when trained, regression models can help establish the relationship between variables by estimating how one affects the other to create the model, use the create mlmodel syntax for details, see machine learning model functions docid\ jsgwuw5og56fzrve5h10g regression model common options the regression model types have many common options that provide similar functionality this table describes the purpose of the more common options option notes metrics the model supports the metrics option for all regression types if this option is set to true , then the model also calculates the coefficient of determination (also known as the r^2 value), the root mean squared error (rmse), and the adjusted r^2 value if the option is not specified, then the model defaults this option to false threshold most regression model types support the threshold option, which enables soft thresholding this model type is the same as lasso regression for simple linear regression however, this model type is not quite the same as lasso regression for other model types if you specify a non zero threshold, the model shifts coefficients towards zero by the amount of the threshold if this shift causes the coefficient to switch from positive to negative or vice versa, the model sets the coefficient to zero the system performs this coefficient adjustment as a final step before the system saves the model coefficients any reported coefficient of determination is based on the model prior to adjustment the adjustment always causes the coefficient of determination to be smaller however, this approach is useful as it prevents overfitting gamma some regression model types support the gamma option the value of this option is the tikhonov square matrix in the form {{a, b, c, }, {d, e, f, }, {g, h, i, }} if you do not specify this option, the model defaults the option to the 0 matrix, which equates to multiple linear regression without regularization you can perform ridge regression by using a gamma value this is a bit complicated for the polynomial and linear combination regression model types the system computes these model types by using multiple linear regression after applying the functions to be linearly combined to the input arguments in these cases, the system applies ridge regression after applying the model functions to the model input values weighted some regression model types also support weighted least squares the model types do not support generalized least squares where the weight matrix can be non diagonal to use weighted least squares, add the option weighted and set it to true then, add one more column to the result set used to create the model this new column must appear last, after the dependent variable, and it specifies the weight for that sample model option names are case sensitive simple linear regression model type simple linear regression trains a new machine learning model of type \<model type> on the result set returned by the sql select statement after the database creates the model, \<model name> becomes an executable function in sql select statements create mlmodel \<model name> type \<model type> on( \<sql select statement> ) \[options(\<option list>)] the result set used as input to the model must have two numeric columns the first column is the independent variable (referred to as x ) the second column is the dependent variable (referred to as y ) the model finds the least squares best fit for y = ax + b the simple linear regression model type also supports the yintercept option, which is the y intercept of the resulting best fit line if you do not specify this option, the model does not force the y intercept to be any particular value, and the model uses least squares to find the best value instead if you force the y intercept to be a particular value, the uses least squares to find the best fit with that constraint model options optional metrics — if you set this option to true , the model collects quality metrics such as the coefficient of determination (r^2) and the root mean squared error (rmse) this option defaults to false yintercept — if you set this option, then the option must be a numeric value the system forces the specific y intercept (i e , the model value when x is zero) threshold — this option enables soft thresholding if you specify this option, then the option must be a positive numeric value after the model calculates the coefficients, if any are greater than the threshold value, the threshold value is subtracted from them if any coefficients are less than the negation of the threshold value, the model adds the threshold value to them for any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if you set this option to true , the database prevents the deletion of any intermediate tables that it creates during model training this option defaults to false execute the model create a simple linear regression model with y intercept 10 collect metrics for the model execution by setting the metrics option to true create mlmodel my model type simple linear regression on ( select x1, y from public my table ) options( 'yintercept' > '10', 'metrics' > 'true' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model after training, the model takes a single numeric argument that represents x and returns ax + b select my model(col1) from my table; after you execute a model, you can find the details about the execution results in the sys simple linear regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in system catalog multiple linear regression model type multiple linear regression multiple linear regression means that there is a vector of independent variables and the dependent variable is a scalar valued function of the vector input, and that it is linear in all vector components the model uses the result set as an input the result set has n columns that must be numeric the first n 1 columns are the independent variables (it can be considered a single independent variable that is a vector) the last column is the dependent variable the model finds the least squares best fit for y = a1 x1 + a2 x2 + + b , or in vector notation, y = ax + b , where a and x are vectors and the multiplication is a dot product model options optional metrics — if you set this option to true , the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (rmse) this option defaults to false threshold — this option enables soft thresholding if you specify this option, the option must be a positive numeric value after the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients if any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients for any coefficients that are between the negative and positive threshold values, the model sets the coefficients to zero weighted — if you set this option to true , the model performs weighted least squares regression, where each sample has a weight or importance associated with it in this case, there is an extra numeric column after the dependent variable that has the weight for the sample this option defaults to false gamma — if you set this option, the option must be a matrix the value represents a tikhonov gamma matrix used for regularization for details, see tikhonov regularization the model uses this option for ridge regression yintercept — if you set this option, then the option must be a numeric value the system forces the specific y intercept (i e , the model value when x is zero) loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if you set this option to true , the database prevents the deletion of any intermediate tables that it creates during model training this option defaults to false execute the model create a multiple linear regression model collect metrics for the model execution by setting the metrics option to true create mlmodel my model type multiple linear regression on ( select x1, x2, x3, y from public my table ) options( 'metrics' > 'true' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model after training, the independent variables are provided to the model function execution and the function returns the estimate of the dependent variable select my model(col1, col2, col3) from my table; after you execute a model, you can find the details about the execution results in the sys multiple linear regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in system catalog vector autoregression model type vector autoregression vector auto regression is a model that estimates the next value of multiple variables based on some number of lags of all variables, as a group the model tries to build the following estimate \<x1(t), x2(t)> based on x1(t 1), x2(t 1), x1(t 2), and x2(t 2) in this example, x1(t) means the value of x1 at time t , and x1(t 1) means the value of x1 at time t 1 (typically the previous sample time) the syntax \<x1(t), x2(t)> demonstrates that the result of the model is a row vector that contains all the predictions of the model and that all predictions rely on all lags of all variables when you create a model, the input result set must have one more column than the number of lags each column must be a row vector of a size equal to the number of variables the first column is the un lagged values, e g , {{x1, x2, x3}} the second column is the first lag for all variables, e g , {{x1 lag1, x2 lag2, x3 lag3}} model options required numvariables — specify this option as a positive integer for the number of variables in the model numlags — specify this option as a positive integer for the number of lags in the model optional metrics — if you set this option to true, the function collects the metric for the coefficient of determination (r^2) this option defaults to false threshold — this option enables soft thresholding if you specify this option, then the option must be a positive numeric value after the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from them if any coefficients are less than the negation of the threshold value, the model adds the threshold value to them for any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if you set this option to true , the database prevents the deletion of any intermediate tables that it creates during model training this option defaults to false execute the model create a vector autoregression model with three variables and four lags on each one collect metrics for the model execution by setting the metrics option to true create mlmodel my model type vector autoregression on ( select {{x1, x2, x3}}, {{x1 lag1, x2 lag1, x3 lag1}}, {{x1 lag2, x2 lag2, x3 lag2}}, {{x1 lag3, x2 lag3, x3 lag3}}, {{x1 lag4, x2 lag4, x3 lag4}} from ( select x1, x2, x3, lag(x1, 1) over(order by t) as x1 lag1, lag(x1, 2) over(order by t) as x1 lag2, lag(x1, 3) over(order by t) as x1 lag3, lag(x1, 4) over(order by t) as x1 lag4, lag(x2, 1) over(order by t) as x2 lag1, lag(x2, 2) over(order by t) as x2 lag2, lag(x2, 3) over(order by t) as x2 lag3, lag(x2, 4) over(order by t) as x2 lag4, lag(x3, 1) over(order by t) as x3 lag1, lag(x3, 2) over(order by t) as x3 lag2, lag(x3, 3) over(order by t) as x3 lag3, lag(x3, 4) over(order by t) as x3 lag4 from public my table ) ) options( 'metrics' > 'true', 'numvariables' > '3', 'numlags' > '4' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model after training, the number of arguments you specify must be equal to the number of lags each of those arguments must be a row vector that contains lags for all model variables the first argument is the first lag the second argument is the second lag, and so on in this example, the unlagged value is the first lag, which means that the model must predict the next value select my model( {{x1 lag1, x2 lag1, x3 lag1}}, {{x1 lag2, x2 lag2, x3 lag2}}, {{x1 lag3, x2 lag3, x3 lag3}}, {{x1 lag4, x2 lag4, x3 lag4}}) as predicted, {{x1, x2, x3}} as actual from ( select x1, x2, x3, lag(x1, 1) over(order by t) as x1 lag1, lag(x1, 2) over(order by t) as x1 lag2, lag(x1, 3) over(order by t) as x1 lag3, lag(x1, 4) over(order by t) as x1 lag4, lag(x2, 1) over(order by t) as x2 lag1, lag(x2, 2) over(order by t) as x2 lag2, lag(x2, 3) over(order by t) as x2 lag3, lag(x2, 4) over(order by t) as x2 lag4, lag(x3, 1) over(order by t) as x3 lag1, lag(x3, 2) over(order by t) as x3 lag2, lag(x3, 3) over(order by t) as x3 lag3, lag(x3, 4) over(order by t) as x3 lag4 from my table ); after you execute a model, you can find the details about the execution results in the sys vector autoregression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in system catalog polynomial regression model type polynomial regression polynomial regression means that there are one to many independent variables and one dependent variable and that should be modeled in terms of an nth degree polynomial of the independent variables the order option must be set to a positive integer value that specifies the degree of the polynomial to use if you specify the option negativepowers and set it to true , then the model also includes terms with negative exponents, but still with the restriction that the sum of the absolute value of the exponents in a term will be less than or equal to the value specified on the order option regardless of whether you use the negativepowers option, the model computes a coefficient for every possible term that meets this restriction when you use the negativepowers option, the model contains many more terms for example, a quadratic model over two independent variables has six terms, but when you use the negativepowers option, the model has 13 terms the result set you specify as input to the model has n columns, which must all be numeric the first n 1 columns are the independent variables (it can be considered a single independent variable that is a vector) the last column is the dependent variable the model finds the least squares best fit of a sum of all possible combinations of terms where the degree is less than or equal to the value of the order option for example, with two independent variables ( x1 and x2 ) and order set to 2, the model is y = a1 x1^2 + a2 x2^2 + a3 x1 x2 + a4 x1 + a5 x2 + b model options required order — this option is the degree of the polynomial and must be set to a positive integer optional metrics — if you set this option to true , the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (rmse) this option defaults to false threshold — this option enables soft thresholding if you specify this option, then the option must be a positive numeric value after the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from them if any coefficients are less than the negation of the threshold value, the model adds the threshold value to them for any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero weighted — if you set this option to true , the model performs weighted least squares regression, where each sample has an associated weight when weighted, there is an extra numeric column after the dependent variable that has the weight for the sample this option defaults to false negativepowers — if you set this option to true , the model includes independent variables raised to negative powers these variables are called laurent polynomials the model generates all possible terms such that the sum of the absolute value of the power of each term in each product is less than or equal to the order for example, with two independent variables and the order set to 2, the model is y = a1 x1^2 + a2 x1^ 2 + a3 x2^2 + a4 x2^ 2 + a5 x1 x2 + a6 x1^ 1 x2 + a7 x1 x2^ 1 + a8 x1^ 1 x2^ 1 + a9 x1 + a10 x1^ 1 + a11 x2 + a12 x2^ 1 + b gamma — if you specify this option, the value must be a matrix the value represents a tikhonov gamma matrix that is used for regularization for details, see tikhonov regularization yintercept — if you set this option, then the option must be a numeric value the system forces the specific y intercept (i e , the model value when x is zero) loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if you set this option to true , the database prevents the deletion of any intermediate tables that it creates during model training this option defaults to false execute the model create a polynomial regression model of degree 3 collect metrics for the model execution by setting the metrics option to true create mlmodel my model type polynomial regression on ( select x1, x2, x3, y from public my table ) options( 'order' > '3', 'metrics' > 'true' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable select my model(col1, col2, col3) from my table; after you execute a model, you can find the details about the execution results in the sys polynomial regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in system catalog linear combination regression model type linear combination regression linear combination regression is a model built on top of m independent variables and a single dependent variable but in this case, the function used to perform least squares regression is a linear combination of functions that you specify the general form is y = c0 + c1 f1(x1, x2, ) + f2(x1, x2, ) + the model determines the number of independent variables based on the number of columns in the sql statement that the model is based on there is always a column for the dependent variable, and there can be a weight column (see the weighted option) so, the number of independent variables is either one or two less than the number of columns in the result of the input sql statement the number of user specified functions for the model must be specified by defining function1 , function2 , and so on (keys in the options dictionary) as long as consecutive function key names exist, the model includes these names the model always includes a constant term the value strings for the functionn keys must be specified in sql syntax and should refer to x1, x2, for the model input independent variables the result set you specify as input to the model has n columns, which must all be numeric the first n 1 columns are independent variables (it can be considered a single independent variable that is a vector) the last column is the dependent variable the model finds the least squares best fit for a model of the form y = a1 f1(x1, x2, xn) + a2 f2(x1, x2, xn) + + an fn(x1, x2, nx) , where f1, f2, , fn are functions that are provided in a required option model options required functionn — you must specify the first function ( f1 ) using a key named 'function1' subsequent functions must use keys with names that use subsequent values of n you must specify functions in sql syntax and should use the variables x1, x2, , xn to refer to the 1st, 2nd, and nth independent variables, respectively for example, 'function1' > 'sin(x1 x2 + x3)', 'function2' > 'cos(x1 x3)' optional metrics — if you set this option to true , the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (rmse) this option defaults to false threshold — this option enables soft thresholding if you specify this option, the option must be a positive numeric value after the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients if any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients for any coefficients between the negative and positive threshold values, the model sets the coefficients to zero weighted — if you set this option to true , the model performs weighted least squares regression, where each sample has an associated weight or importance when weighted, there is an extra numeric column after the dependent variable that represents the weight of the sample this option defaults to false gamma — if you specify this option, it must be a matrix the value represents a tikhonov gamma matrix used for regularization for details, see tikhonov regularization yintercept — if you set this option, then the option must be a numeric value the system forces the specific y intercept (i e , the model value when x is zero) loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if you set this option to true , the database prevents the deletion of any intermediate tables that it creates during model training this option defaults to false execute the model create a linear combination regression with two functions that exhibit a wave like pattern create mlmodel my model type linear combination regression on ( select x1, x2, x3, y1 from public my table ) options( 'function1' > 'sin(x1 x2 + x3)', 'function2' > 'cos(x1 x3)' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable select my model(col1, col2, col3) from my table; after you execute a model, you can find the details about the execution results in the sys linear combination regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in system catalog nonlinear regression model type nonlinear regression nonlinear regression finds best fit parameters of an arbitrary function using an arbitrary loss function this model type essentially provides direct access to capabilities that both logistic regression and support vector machines rely on the first n 1 columns are numeric independent variables, and the last column is the numeric dependent variable for faster, lower quality models, reduce the popsize , initialiterations , and subsequentiterations options conversely, for slower, higher quality models, increase the values for these same options model options required numparameters — specify this option as a positive integer this value specifies the number of different parameters to optimize, i e , how many different an variables there are in the user specified function function — specify the name of the function to fit the data in sql syntax use a1, a2, … to refer to the parameters for optimization use x1, x2, … to refer to the input features the model does not allow some sql functions the model allows only scalar expressions that can be represented internally as postfix expressions most notably, the model does not allow some functions that are rewritten as case statements (like l east() and greatest() ) if your function is not allowed, the model displays an error message optional metrics — if you set this option to true, the model calculates the coefficient of determination (r^2), the adjusted r^2, and the root mean squared error (rmse) however, the model calculates these quality metrics using the least squares loss function, and not the user specified loss function because these metrics really only make sense for least squares lossfunction — if you specify this option, this parameter dictates the nonlinear optimizer the loss function uses on a per sample basis then, the actual loss function is the sum of this function applied to all samples the model should use the variable y to refer to the dependent variable in the training data and the variable f to refer to the computed estimate for a specified sample the default is least squares, which could be specified as (f y) (f y) popsize — if you specify this option, the value must be a positive integer this option sets the population size for the particle swarm optimization (pso) part of the algorithm it defaults to 100 mininitparamvalue — if you specify this option, the value must be a floating point number this option sets the minimum for initial parameter values in the optimization algorithm it defaults to 10 maxinitparamvalue — if you specify this option, the value must be a floating point number this option sets the maximum for initial parameter values in the optimization algorithm it defaults to 10 initialiterations — if you specify this option, the value must be a positive integer this option sets the number of pso iterations for the first pso pass this option defaults to 500 subsequentiterations — if you specify this option, the value must be a positive integer this value sets the number of subsequent pso iterations after the initial pass this option defaults to 100 momentum — if you specify this option, the value must be a positive floating point number this parameter controls how much pso iterations move away from the local best value to explore new territory this option defaults to 0 1 gravity — if you specify this option, the value must be a positive floating point number this parameter controls how much pso iterations are drawn back towards the local best value this option defaults to 0 01 lossfuncnumsamples — if you specify this option, the value must be a positive integer this parameter controls how many points the model samples when estimating the loss function this option defaults to 1000 numgaattempts — if you specify this option, the value must be a positive integer this parameter controls how many ga crossover possibilities the model tries this option defaults to 10 million maxlinesearchiterations — if you specify this option, the value must be a positive integer this parameter controls the maximum allowed number of iterations when the model runs the line search part of the algorithm this option defaults to 200 minlinesearchstepsize — if you specify this option, the value must be a positive floating point number this parameter controls the minimum step size that the line search algorithm takes this option defaults to 1e 5 ridgecoefficient — if you specify this option, the value must be a double data type this option is the ridge coefficient for the loss function the default behavior is the function ignores this option, effectively setting this option to 0 0 lassocoefficient — if you specify this option, the value must be a double data type this option is the lasso coefficient for the loss function the default behavior is the function ignores this option, effectively setting this option to 0 0 samplesperthread — if you specify this option, the value must be a positive integer this parameter controls the target number of samples sent to each thread each thread independently computes a logistic regression model, and the models are all combined at the end the option defaults to 1 million loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if you set this option to true , the database prevents the deletion of any intermediate tables that it creates during model training this option defaults to false execute the model create a nonlinear regression that fits five parameters a1, a2, a3, a4, a5 with two independent variables x1, x2 create mlmodel my model type nonlinear regression on ( select x1, x2, y1 from public my table ) options( 'numparameters' > '5', 'function' > 'a1 sin(a2 x1 + a3) + a4 + a5 x2' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when the model executes, pass n 1 independent variables, and the model returns the estimate of the dependent variable select my model(x1, x2) from my table; after you execute a model, you can find the details about the execution results in the sys nonlinear regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in the system catalog related links machine learning in ocient docid\ arh0cufpzx bd gw0248n classification models docid\ i jn69mztmmumjrembhsn machine learning models docid\ cq7pmtrfsxiaz5dp2d9bs system catalog docid\ v571glhmrgmaqsa2dzm r