Regression Models
Regression models investigate the relationship between independent variables and at least one dependent variable or outcome. When trained, regression models can help establish the relationship between variables by estimating how one affects the other.
The regression model types have many common options that provide similar functionality. The following table describes the purpose of the more common options.
Option | Notes |
metrics | The model supports the metrics option for all regression types. If this option is set to true, then the model also calculates the coefficient of determination (also known as the R^2 value), the root mean squared error (RMSE), and the adjusted R^2 value. If the option is not specified, then the model defaults this option to false. |
threshold | Most regression model types support the threshold option, which enables soft-thresholding. This model type is the same as lasso regression for simple linear regression. However, this model type is not quite the same as lasso regression for other model types. If you specify a non-zero threshold, the model shifts coefficients towards zero by the amount of the threshold. If this shift causes the coefficient to switch from positive to negative or vice versa, the model sets the coefficient to zero. The system performs this coefficient adjustment as a final step before the system saves the model coefficients. Any reported coefficient of determination is based on the model prior to adjustment. The adjustment always causes the coefficient of determination to be smaller. However, this approach is useful as it prevents overfitting. |
gamma | Some regression model types support the gamma option. The value of this option is the Tikhonov square matrix in the form: {{a, b, c, ...}, {d, e, f, ...}, {g, h, i, ...}}. If you do not specify this option, the model defaults the option to the 0 matrix, which equates to multiple linear regression without regularization. You can perform ridge regression by using a gamma value. This is a bit complicated for the polynomial and linear combination regression model types. The system computes these model types by using multiple linear regression after applying the functions to be linearly combined to the input arguments. In these cases, the system applies ridge regression after applying the model functions to the model input values. |
weighted | Some regression model types also support weighted least squares. The model types do not support generalized least squares where the weight matrix can be non-diagonal. To use weighted least squares, add the option weighted and set it to true. Then, add one more column to the result set used to create the model. This new column must appear last, after the dependent variable, and it specifies the weight for that sample. |
Model Type: SIMPLE LINEAR REGRESSION
Trains a new machine learning model of type <model_type> on the result set returned by the SQL SELECT statement. After the database creates the model, <model_name> becomes a executable function in SQL SELECT statements.
The result set used as input to the model must have two numeric columns. The first column is the independent variable (referred to as x). The second column is the dependent variable (referred to as y). The model finds the least squares best fit for y = ax + b.
The simple linear regression model type also supports the yIntercept option, which is the desired y-intercept of the resulting best fit line. If you do not specify this option, the model does not force the y-intercept to be any particular value and the model uses least squares to find the best value instead. If you force the y-intercept to be a particular value, the uses least squares to find the best fit with that constraint.
Example:
When you execute the model after training, the model takes a single numeric argument that represents x and returns ax + b.
metrics - If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2) and the root mean squared error (RMSE). This option defaults to false.
yIntercept - If you set this option, then the option must be a numeric value. The system forces the specific y-intercept (i.e. the model value when x is zero).
threshold - This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from the coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
Model Type: MULTIPLE LINEAR REGRESSION
Multiple linear regression means that there is a vector of independent variables and the dependent variable is a scalar-valued function of the vector input, and that it is linear in all vector components.
The model uses the result set as an input. The result set has N columns that must be numeric. The first N - 1 columns are the independent variables (it can be considered a single independent variable that is a vector). The last column is the dependent variable. The model finds the least squares best fit for y = a1 * x1 + a2 * x2 + ... + b, or in vector notation, y = ax + b, where a and x are vectors and the multiplication is a dot product.
Example:
When you execute the model after training, the independent variables are provided to the model function execution and the function returns the estimate of the dependent variable.
metrics - If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (RMSE). This option defaults to false.
threshold - This option enables soft thresholding. If you specify this option, the option must be a positive numeric value. After the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients. For any coefficients that are between the negative and positive threshold values, the model sets the coefficients to zero.
weighted - If you set this option to true, the model performs weighted least squares regression, where each sample has a weight or importance associated with it. In this case, there is an extra numeric column after the dependent variable that has the weight for the sample. This option defaults to false.
gamma - if you set this option, the option must be a matrix. The value represents a Tikhonov gamma matrix used for regularization. For details, see Tikhonov regularization. The model uses this option for ridge regression.
Model Type: VECTOR AUTOREGRESSION
Vector auto-regression is a model that estimates the next value of multiple variables based on some number of lags of all variables, as a group. The model tries to build the following:
In this example, x1(t) means the value of x1 at time t, and x1(t-1) means the value of x1 at time t-1 (typically the previous sample time). The syntax <x1(t), x2(t)> demonstrates that the result of the model is a row vector that contains all the predictions of the model, and that all predictions rely on all lags of all variables.
When you create a model, the input result set must have one more column than the number of lags. Each column must be a row vector of a size equal to the number of variables. The first column is the un-lagged values, e.g., {{x1, x2, x3}}. The second column is the first lag for all variables, e.g., {{x1_lag1, x2_lag2, x3_lag3}}.
Example:
When you execute the model after training, the number of arguments you specify must be equal to the number of lags. Each of those arguments must be a row vector that contains lags for all model variables. The first argument is the first lag. The second argument is the second lag, and so on. In this example, the unlagged value is the first lag, which means that the model must predict the next value.
numVariables - Specify this option as a positive integer for the number of variables in the model.
numLags - Specify this option as a positive integer for the number of lags in the model.
metrics - If you set this option to true, the function collects the metric for the coefficient of determination (r^2). This option defaults to false.
threshold - This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from the coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
Model Type: POLYNOMIAL REGRESSION
Polynomial regression means that there are one to many independent variables and one dependent variable and that should be modeled in terms of an nth degree polynomial of the independent variables.
The order option must be set to a positive integer value that specifies the degree of the polynomial to use. If you specify the option negativePowers and set it to true, then the model also includes terms with negative exponents, but still with the restriction that the sum of the absolute value of the exponents in a term will be less than or equal to the value specified on the order option. Regardless of whether you use the negativePowers option, the model computes a coefficient for every possible term that meets this restriction. When you use the negativePowers option, the model contains many more terms. For example, a quadratic model over two independent variables has six terms, but when you use the negativePowers option, the model has 13 terms.
The result set you specify as input to the model has N columns, which must all be numeric. The first N-1 columns are the independent variables (it can be considered a single independent variable that is a vector). The last column is the dependent variable. The model finds the least squares best fit of a sum of all possible combinations of terms where the degree is less than or equal to the value of the order option. For example, with two independent variables (x1 and x2) and order set to 2, the model is y = a1*x1^2 + a2*x2^2 + a3*x1*x2 + a4*x1 + a5*x2 + b.
Example:
When you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable.
order - This option is the degree of the polynomial and must be set to a positive integer.
metrics - If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (RMSE). This option defaults to false.
threshold - This option enables soft thresholding. If you specify this option, then the option must be a positive numeric value. After the model calculates the coefficients, if any of them are greater than the threshold value, the threshold value is subtracted from the coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to them. For any coefficients that are between the negative and positive threshold values, the model sets those coefficients to zero.
weighted - If you set this option to true, the model performs weighted least squares regression, where each sample has an associated weight. When weighted, there is an extra numeric column after the dependent variable that has the weight for the sample. This option defaults to false.
negativePowers - If you set this option to true, the model includes independent variables raised to negative powers. These variables are called Laurent polynomials. The model generates all possible terms such that the sum of the absolute value of the power of each term in each product is less than or equal to the order. For example, with two independent variables and the order set to 2, the model is: y = a1*x1^2 + a2*x1^-2 + a3*x2^2 + a4*x2^-2 + a5*x1*x2 + a6*x1^-1*x2 + a7*x1*x2^-1 + a8*x1^-1*x2^-1 + a9*x1 + a10*x1^-1 + a11*x2 + a12*x2^-1 + b.
gamma - If you specify this option, the value must be a matrix. The value represents a Tikhonov gamma matrix that is used for regularization. For details, see Tikhonov regularization.
Model Type: LINEAR COMBINATION REGRESSION
Linear combination regression is a model built on top of m independent variables and a single dependent variable. But in this case, the function used to perform least-squares regression is a linear combination of functions that you specify. The general form is y = c0 + c1 * f1(x1, x2, ...) + f2(x1, x2, ...) + ....
The model determines the number of independent variables based on the number of columns in the SQL statement that the model is based on. There is always a column for the dependent variable and there can be a weight column (see weighted option below). So, the number of independent variables is either one or two less than the number of columns in the result of the input SQL statement. The number of user-specified functions for the model must be given by defining function1, function2, and so on (keys in the options dictionary). As long as consecutive function key names exist, the model includes these names. The model always includes a constant term. The value strings for the functionN keys must be specified in SQL syntax and should refer to x1, x2, ... for the model input independent variables.
The result set you specify as input to the model has N columns, which must all be numeric. The first N-1 columns are independent variables (it can be considered a single independent variable that is a vector). The last column is the dependent variable. The model finds the least squares best fit for a model of the form y = a1 * f1(x1, x2, ... xn) + a2 * f2(x1, x2, ... xn) + ... + an * fn(x1, x2, ... nx), where f1, f2, ..., fn are functions that are provided in a required option.
Example:
To create a linear combination regression with two functions that exhibit a wave like pattern, run this command:
When you execute the model after training, the independent variables are provided to the model function execution, and the function returns the estimate of the dependent variable.
functionN - You must specify the first function (f1) using a key named 'function1'. Subsequent functions must use keys with names that use subsequent values of N. You must specify functions in SQL syntax and should use the variables x1, x2, ..., xn to refer to the 1st, 2nd, and nth independent variables respectively. For example,'function1' -> 'sin(x1 * x2 + x3)', 'function2' -> 'cos(x1 * x3)'.
metrics - If you set this option to true, the model collects quality metrics such as the coefficient of determination (r^2), the adjusted coefficient of determination, and the root mean squared error (RMSE). This option defaults to false.
threshold - This option enables soft thresholding. If you specify this option, the option must be a positive numeric value. After the model calculates the coefficients, if any coefficients are greater than the threshold value, the model subtracts the threshold value from the coefficients. If any coefficients are less than the negation of the threshold value, the model adds the threshold value to the coefficients. For any coefficients between the negative and positive threshold values, the models sets the coefficients to zero.
weighted - If you set this option to true, the model performs weighted least squares regression, where each sample has an associated weight or importance. When weighted, there is an extra numeric column after the dependent variable that represents the weight for the sample. This option defaults to false.
gamma - If you specify this option, the option must be a matrix. The value represents a Tikhonov gamma matrix used for regularization. For details, see Tikhonov regularization.
Model Type: LOGISTIC REGRESSION
This model fits a logistic curve to the data such that when the value is greater than 0.5, the result is one class, and when the value is less than 0.5, the result is the other class. This model is a binary classification algorithm.
The first N - 1 inputs are features and must be numeric. Features can be one-hot encoded. The last input column is the class or label. There must be exactly two non-NULL labels in the result set used to create the model. The model best fits the logistic curve using a negative log likelihood loss function. The model uses an algorithm that is a combination of particle swarm optimization, line search, and genetic algorithms to find the best fit parameters.
For faster, lower quality models, try reducing the popSize, initialIterations, and subsequentIterations options. Conversely, for slower, higher quality models, try increasing the values for these same options.
When you execute this model after training, you must supply the features as input and the label as the output. The label can be any data type.
metrics - If you set this option to true, the model calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. This option defaults to false.
popSize - If you specify this option, the value must be a positive integer. This value sets the population size for the particle swarm optimization (PSO) part of the algorithm. This option defaults to 100.
minInitParamValue - If you specify this option, the value must be a floating point number. This value sets the minimum for initial parameter values in the optimization algorithm. This option defaults to -10.
maxInitParamValue - If you specify this option, the value must be a floating point number. This value sets the maximum for initial parameter values in the optimization algorithm. This option defaults to 10.
initialIterations - If you specify this option, the value must be a positive integer. This value sets the number of PSO iterations for the first PSO pass. This option defaults to 500.
subsequentIterations - If you specify this option, the value must be a positive integer. This value sets the number of PSO iterations for subsequent iterations of the PSO algorithm. This option defaults to 100.
momentum - If you specify this option, the value must be a positive floating point. This parameter controls how much PSO iterations move away from the local best value to explore new territory. This option defaults to 0.1.
gravity - If you specify this option, the value must be a positive floating point. This parameter controls how much PSO iterations are drawn back towards the local best value. This option defaults to 0.01.
lossFuncNumSamples - If you specify this option, the value must be a positive integer. This parameter controls how many points are sampled when estimating the loss function. This option defaults to 1000.
numGAAttempts - If you specify this option, the value must be a positive integer. This parameter controls how many GA crossover possibilities the model tries. This option defaults to 10 million.
maxLineSearchIterations - If you specify this option, the value must be a positive integer. This parameter controls the maximum allowed number of iterations when the model runs the line search part of the algorithm. This option defaults to 200.
minLineSearchStepSize - If you specify this option, the value must be a positive floating point. This parameter controls the minimum step size that the line search algorithm ever takes. This option defaults to 1e-5.
samplesPerThread - If you specify this option, the value must be a positive integer. This parameter controls the target number of samples that are sent to each thread. Each thread independently computes a logistic regression model, and the threads are all combined at the end. This option defaults to 1 million.
Model Type: NONLINEAR REGRESSION
Nonlinear regression finds best-fit parameters of an arbitrary function using an arbitrary loss function. This model type essentially provides direct access to capabilities that both logistic regression and support vector machines rely on. The first N - 1 columns are numeric independent variables, and the last column is the numeric dependent variable. For faster, lower quality models, reduce the popSize, initialIterations, and subsequentIterations options. Conversely, for slower, higher quality models, increase the values for these same options.
To create a nonlinear regression that fits five parameters a1, a2, a3, a4, a5 with two independent variables x1, x2, use the following statement.
When the model executes, pass N - 1 independent variables and the model returns the estimate of the dependent variable.
numParameters - Specify this option as a positive integer. This value specifies the number of different parameters to optimize, i.e. how many different aN variables there are in the user-specified function.
function - Specify the name of the function to fit to the data in SQL syntax. Use a1, a2, … to refer to the parameters for optimization. Use x1, x2, … to refer to the input features. The model does not allow some SQL functions. The model allows only scalar expressions that can be represented internally as postfix expressions. Most notably, the model does not allow some functions that are rewritten as CASE statements (like least() and greatest()). If your function is not allowed, the model displays an error message.
metrics - If you set this option to true, the model calculates the coefficient of determination (r^2), the adjusted r^2, and the root mean squared error (RMSE). However, the model calculates these quality metrics using the least squares loss function, and not the user-specified loss function because these metrics really only make sense for least squares.
lossFunction - If you specify this option, this parameter dictates the nonlinear optimizer the loss function uses on a per sample basis. Then, the actual loss function is the sum of this function applied to all samples. The model should use the variable y to refer to the dependent variable in the training data and should use the variable f to refer to the computed estimate for a given sample. The default is least squares, which could be specified as (f-y)*(f-y).
popSize - If you specify this option, the value must be a positive integer. This option sets the population size for the particle swarm optimization (PSO) part of the algorithm. This option defaults to 100.
minInitParamValue - If you specify this option, the value must be a floating point number. This option sets the minimum for initial parameter values in the optimization algorithm. This option defaults to -10.
maxInitParamValue - If you specify this option, the value must be a floating point number. This option sets the maximum for initial parameter values in the optimization algorithm. This option defaults to 10.
initialIterations - If you specify this option, the value must be a positive integer. This option sets the number of PSO iterations for the first PSO pass. This option defaults to 500.
subsequentIterations - If you specify this option, the value must be a positive integer. This value sets the number of subsequent PSO iterations after the initial pass. This option defaults to 100.
momentum - If you specify this option, the value must be a positive floating point number. This parameter controls how much PSO iterations move away from the local best value to explore new territory. This option defaults to 0.1.
gravity - If you specify this option, the value must be a positive floating point number. This parameter controls how much PSO iterations are drawn back towards the local best value. This option defaults to 0.01.
lossFuncNumSamples - If you specify this option, the value must be a positive integer. This parameter controls how many points the model samples when estimating the loss function. This option defaults to 1000.
numGAAttempts - If you specify this option, the value must be a positive integer. This parameter controls how many GA crossover possibilities the model tries. This option defaults to 10 million.
maxLineSearchIterations - If you specify this option, the value must be a positive integer. This parameter controls the maximum allowed number of iterations when the model runs the line search part of the algorithm. This option defaults to 200.
minLineSearchStepSize - If you specify this option, the value must be a positive floating point number. This parameter controls the minimum step size that the line search algorithm ever takes. This option defaults to 1e-5.
samplesPerThread - If you specify this option, the value must be a positive integer. This parameter controls the target number of samples that are sent to each thread. Each thread independently computes a logistic regression model and the models are all combined at the end. The option defaults to 1 million.