Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.ocient.com/llms.txt

Use this file to discover all available pages before exploring further.

enables regression analysis, which is the process of finding a best-fit function for a data set, in the System.

Simple Linear Regression

The simplest case is simple linear regression. Simple linear regression finds a best-fit linear relationship between one variable and another. Suppose you have this data with a bit of noise added. Use the CREATE TABLE AS SELECT SQL statement to create the mldemo.slr table. Generate 100 rows of random data.
SQL
CREATE TABLE mldemo.slr AS
(SELECT c1 AS x, c1 * 3.14 - 1.41 + (RAND() - 0.5) * 0.01 AS y
FROM sys.dummy100);
Modified 100 rows
Create a simple linear regression model using this data.
SQL
CREATE MLMODEL my_first_model TYPE SIMPLE LINEAR REGRESSION
ON (SELECT x, y FROM mldemo.slr);
Explore the created model by viewing the data in the machine_learning_models and simple_linear_regression_models system catalog tables.
SQL
SELECT name, on_select, slope, y_intercept
FROM sys.machine_learning_models a, sys.simple_linear_regression_models b
WHERE a.id = b.machine_learning_model_id AND name = 'my_first_model';
name                                         on_select                                    slope                 y_intercept
--------------------------------------------------------------------------------------------------------------------------------------
my_first_model                               select x, y from mldemo.slr                  3.139995450114637     -1.4102410466743238

Fetched 1 row
You can see the calculated slope and the y-intercept from the model, and see that these values are very close to the values used to generate the data. While the input to the model creation, in this case, is very simple, the SQL statement in the ON clause of the model creation can be arbitrarily complex as long as it meets the requirements of the input for the specific model type being created. Execute the created model by executing the SQL scalar function with the same name as the model you created. The input arguments of the function depend on the specific model type. For regression, the values are the independent variables, and then the function returns the dependent variables.
SQL
SELECT x, y AS actual, my_first_model(x) AS predicted FROM mldemo.slr LIMIT 10;
x          actual                predicted
-------------------------------------------------------
51         158.730137908257      158.72952690917217
52         161.87052128169267    161.8695223592868
53         165.0119331158625     165.00951780940144
54         168.1478263920115     168.14951325951608
55         171.28758882396247    171.28950870963072
56         174.43476312079238    174.42950415974536
57         177.57385971080322    177.56949960986
58         180.70933136639127    180.70949505997464
59         183.84856961481378    183.84949051008925
60         186.99044288540566    186.9894859602039

Fetched 10 rows
The model makes predictions that are very close to the actual training data. Then, you can use this data to generate predictions for data that is unknown.

Machine Learning Model Options

You can provide options for most model types. Provide options by pairing the string for an option to its corresponding value. For example, simple linear regression models have a metrics option that you can use to ask the database to calculate some quality of fit metrics. Use the same model and enable metrics collection, which is off by default.
SQL
DROP MLMODEL my_first_model;
Modified 0 rows

CREATE MLMODEL my_first_model TYPE SIMPLE LINEAR REGRESSION
ON (SELECT x, y FROM mldemo.slr) options('metrics' -> 'true');
Modified 0 rows
Examine the machine_learning_models and simple_linear_regression_models system catalog tables for more fields.
SQL
SELECT name, on_select, slope, y_intercept, coefficient_of_determination,
adjusted_r2, rmse FROM sys.machine_learning_models a, sys.simple_linear_regression_models b
WHERE a.id = b.machine_learning_model_id AND name = 'my_first_model';
name                                         on_select                                    slope                 y_intercept           coefficient_of_determination adjusted_r2           rmse
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
my_first_model                               select x, y from mldemo.slr                  3.139995450114637     -1.4102410466743238   0.9999999990764926           0.999999999067069     2.7544625001417E-4

Fetched 1 row
The database also calculates the coefficient of determination (frequently named r2), the adjusted r2 value, and the RMSE (root mean squared error). All of these values indicate that this model has an extremely good fit. In addition, you can force a specific y-intercept such that the value of f(0) is fixed. Use the yIntercept option. Create a model on the same data, but force the y-intercept to be zero. The metrics show that the model is not as good a fit as when the intercept is free to vary.
SQL
CREATE MLMODEL force_zero TYPE SIMPLE LINEAR REGRESSION
ON (SELECT x, y FROM mldemo.slr) options('metrics' -> 'true', 'yIntercept' -> '0');
Modified 0 rows

SELECT name, on_select, slope, y_intercept, coefficient_of_determination,
adjusted_r2, rmse FROM sys.machine_learning_models a, sys.simple_linear_regression_models b
WHERE a.id = b.machine_learning_model_id AND name = 'force_zero';
name                                         on_select                                    slope                 y_intercept           coefficient_of_determination adjusted_r2           rmse
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
force_zero                                   select x, y from mldemo.slr                  3.1189470762836766    0.0                   0.9999403831244921           0.9999397747890277    0.06998440678511778

Fetched 1 row

Multiple Linear Regression

In multiple linear regression, the model is still predicting an output value. But instead of handling one input variable, the model can handle an arbitrary number of input variables. Multiple linear regression finds the best-fit model of the form:
Text
y = a1x1 + a2x2 + a3x3 + … + a0
Simple linear regression and multiple linear regression (as well as polynomial regression and linear combination regression) all use a least-squares loss function. Some of the more advanced models allow you to define best fit yourself if the default definition is not satisfactory. Least-squares best fit means that the model minimizes the sum (over all training data rows) of the square of the model error. Create a training data set with some noise and show that the model can reconstruct the data. Multiple linear regression models also support a metrics option (the metrics are the same as simple linear regression).
SQL
CREATE TABLE mldemo.mlr AS (SELECT a.c1 AS x1, b.c1 AS x2, 1 + 2*a.c1 + 3*b.c1 AS y
FROM sys.dummy10 a, sys.dummy10 b);
Modified 100 rows

CREATE MLMODEL mlr_model TYPE MULTIPLE LINEAR REGRESSION ON (SELECT * FROM mldemo.mlr)
options('metrics' -> 'true');
Modified 0 rows

SELECT x1, x2, y AS actual, mlr_model(x1, x2) AS predicted FROM mldemo.mlr LIMIT 10;
x1         x2         actual              predicted
----------------------------------------------------------------
6          1          16                  15.999999999999975
6          2          19                  18.999999999999975
6          3          22                  21.999999999999975
6          4          25                  24.999999999999975
6          5          28                  27.999999999999975
6          6          31                  30.999999999999975
6          7          34                  33.99999999999997
6          8          37                  36.99999999999997
6          9          40                  39.99999999999997
6          10         43                  42.99999999999997

Fetched 10 rows

Polynomial Regression

With polynomial regression, the database finds a least-squares best-fit polynomial of whatever degree you specify. Same as multiple linear regression, you determine the number of independent variables. The training data has a 1/(x1 * x2) term that the model does not match, but that term quickly tends towards zero, so the model should provide a good fit with a quadratic polynomial of x1 and x2. Some of the coefficients match the data, but some do not because the model is compensating for not being able to fit the form of the data exactly. However, the metrics indicate that this model is still a great fit.
SQL
CREATE TABLE mldemo.pr AS
(SELECT x1, x2, 1.0 / (x1 * x2) - 2*x1 - 0.5*x1*x2 + 2.3*x1^2 + 1.1*x2^2 AS y
FROM (SELECT a.c1 AS x1, b.c1 AS x2 FROM sys.dummy10 a, sys.dummy10 b));
Modified 100 rows

CREATE MLMODEL pr1 TYPE POLYNOMIAL REGRESSION
ON (SELECT * FROM mldemo.pr) options('order' -> '2', 'metrics' -> 'true');
Modified 0 rows

SELECT name, on_select, num_arguments, coefficient_of_determination,
adjusted_r2, rmse, y_intercept, coefficients
FROM sys.machine_learning_models a, sys.polynomial_regression_models b
WHERE a.id = b.machine_learning_model_id AND name = 'pr1';
name                                         on_select                                    num_arguments coefficient_of_determination adjusted_r2           rmse                  y_intercept           coefficients
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pr1                                          select * from mldemo.pr                      2             0.9999991319879137           0.9999991140907573    0.06288797037552916   0.720605599825376     [-0.10943699031755831, 1.105235143324365, -2.1094369903178176, -0.4945162377513963, 2.3052351433243388]

Fetched 1 row
Polynomial regression models have many available options, including everything available for multiple linear regression. The negativePowers option enables you to fit Laurent polynomials. If you set this option to true, the model includes independent variables raised to negative powers. Sums of such terms are named Laurent polynomials. The model generates all possible terms such that the sum of the absolute value of the power of each term in each product is less than or equal to the order. For example, with two independent variables and the order set to 2, the model is: y = a1*x1^2 + a2*x1^-2 + a3*x2^2 + a4*x2^-2 + a5*x1*x2 + a6*x1^-1*x2 + a7*x1*x2^-1 + a8*x1^-1*x2^-1 a9*x1 + a10*x1^-1 + a11*x2 + a12*x2^-1 + b
SQL
CREATE TABLE mldemo.pr_negativepowers AS
(SELECT x1, x2, 1.0 + 12*x2^-1 - 11*x2 + 10*x1^-1 + 9*x1 + 8*(x1 - 1)*(x2 - 1) +
7*x1*x2^-1 + 6*x1^-1*x2 + 5*x1*x2 + 4*x2^-2 + 3*x2^2 + 2*x1^-2 + x1^2 AS y
FROM (SELECT a.c1 AS x1, b.c1 AS x2 FROM sys.dummy10 a, sys.dummy10 b));
Modified 100 rows

CREATE MLMODEL pr2 TYPE POLYNOMIAL REGRESSION
ON (SELECT * FROM mldemo.pr_negativepowers)
options('order' -> '2', 'metrics' -> 'true', 'negativePowers' -> 'true');
Modified 0 rows

SELECT name, on_select, num_arguments, coefficient_of_determination,
adjusted_r2, rmse, y_intercept, coefficients
FROM sys.machine_learning_models a, sys.polynomial_regression_models b
WHERE a.id = b.machine_learning_model_id AND name = 'pr2';
name                                         on_select                                    num_arguments coefficient_of_determination adjusted_r2           rmse                  y_intercept           coefficients
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pr2                                          select * from mldemo.pr                      2             1.0                          1.0                   1.4203541595255201E-10 -3.2741809263825417E-10 [1.378090447893218E-11, 7.594146281547169E-11, 1.0999999999996046, -5.800996726877696E-11, -1.9999999999236024, -0.5000000000004785, -5.027050118995952E-12, 6.257323548197746E-10, -3.953341155821705E-12, 0.9999999999510328, 2.2999999999966576, -3.637978807091713E-10]
You can see that now the model almost perfectly recovers the coefficients, with the rest being nearly zero. Execute the pr2 model.
SQL
SELECT x1, x2, y AS actual, pr2(x1, x2) AS predicted FROM mldemo.pr LIMIT 10;
x1                    x2                    actual                predicted
----------------------------------------------------------------------------------------
6.0                   1.0                   69.06666666666666     69.06666666676095
6.0                   2.0                   69.28333333333333     69.28333333346139
6.0                   3.0                   71.75555555555556     71.75555555569368
6.0                   4.0                   76.44166666666666     76.44166666681195
6.0                   5.0                   83.33333333333333     83.33333333348473
6.0                   6.0                   92.42777777777778     92.42777777793454
6.0                   7.0                   103.72380952380954    103.72380952397093
6.0                   8.0                   117.22083333333333    117.22083333349865
6.0                   9.0                   132.9185185185185     132.91851851868702
6.0                   10.0                  150.81666666666666    150.81666666683765

Fetched 10 rows

Linear Combination Regression

The most complex type of linear regression that Ocient supports is linear combination regression. This regression is a generalization of polynomial regression. Polynomial regression tries to find the best linear combination of polynomial terms (x1, x2,x1^2,x2^2,x1*x2, and so on). Linear combination regression finds the best fit linear combination, where you can specify all the terms. Assume this training data. Use the CREATE TABLE AS SELECT SQL statement to create the mldemo.lcr table. Generate 1000 rows of data.
SQL
CREATE TABLE mldemo.lcr AS
(SELECT x1, x2, x3, 10.0 * sin(x1) + 2.0 * cos(x2) - 3.0 * MOD(INT(x1)*INT(x2)*INT(x3), 4)
AS y FROM (SELECT a.c1 AS x1, b.c1 AS x2, c.c1 AS x3
FROM sys.dummy10 a, sys.dummy10 b, sys.dummy10 c));
Modified 1000 rows
Fit this data using a fifth-degree polynomial model.
SQL
CREATE MLMODEL pr_attempting_lcr TYPE POLYNOMIAL REGRESSION
ON (SELECT * FROM mldemo.lcr) options('order' -> '5', 'metrics' -> 'true');
Modified 0 rows

SELECT name, on_select, num_arguments, coefficient_of_determination,
adjusted_r2, rmse, y_intercept, coefficients
FROM sys.machine_learning_models a, sys.polynomial_regression_models b
WHERE a.id = b.machine_learning_model_id AND name = 'pr_attempting_lcr';
name                                         on_select                                    num_arguments coefficient_of_determination adjusted_r2           rmse                  y_intercept           coefficients
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pr_attempting_lcr                            select * from mldemo.lcr                     3             0.8319917524743956           0.8314857035360654    3.096446300059235     -55.14103736166726    [8.606630821262804, -3.232482517584192, 0.5805449365661026, -0.049741258743219, 0.0016307692308343673, 8.399340340347107, -0.5287664524114927, 0.08856961220855254, -0.0064693555605100815, 1.7800381437636605E-4, -5.4034476625452825, 0.08856961220520951, -0.008490499399775638, 2.684184502433214E-4, 1.4579535930358838, -0.006469355560037162, 2.68418450241128E-4, -0.16482119055274744, 1.780038143557162E-4, 0.006512838626194554, 84.78854869028874, -0.5287664524188649, 0.08856961220814323, -0.006469355560487793, 1.780038143772283E-4, -0.528766452427062, -0.01649034528388861, 0.0012223956025726305, -5.6680387259322435E-5, 0.08856961220929233, 0.0012223956025593338, -1.6695884464630135E-5, -0.006469355560584848, -5.6680387258156464E-5, 1.78003814376761E-4, -44.36009386495971, 0.08856961220998283, -0.008490499399632612, 2.6841845023805184E-4, 0.08856961221025506, 0.0012223956025763133, -1.669588446505692E-5, -0.008490499399797105, -1.6695884465316113E-5, 2.6841845024373886E-4, 9.206023264675196, -0.006469355560753055, 2.6841845023840556E-4, -0.006469355560705054, -5.6680387258737325E-5, 2.6841845024255106E-4, -0.8194647667934305, 1.7800381438661845E-4, 1.7800381438500682E-4, 0.026149389153476754]

Fetched 1 row
The coefficient of determination shows that the model is not a bad fit, but the fit is not great due to the structure of the data. However, given that the model is still a linear combination of functions, linear combination regression can recover the fit if you specify the functions to use.
SQL
CREATE MLMODEL lcr1 TYPE LINEAR COMBINATION REGRESSION
ON (SELECT x1, x2, x3, y FROM mldemo.lcr) options('function1' -> 'sin(x1)',
'function2' -> 'cos(x2)', 'function3' -> 'mod(int(x1)*int(x2)*int(x3),4)',
'metrics' -> 'true');
Modified 0 rows

SELECT name, on_select, num_arguments, coefficient_of_determination,
adjusted_r2, rmse, y_intercept, coefficients
FROM sys.machine_learning_models a, sys.linear_combination_regression_models b
WHERE a.id = b.machine_learning_model_id AND name = 'lcr1';
name                                         on_select                                    num_arguments coefficient_of_determination adjusted_r2           rmse                  y_intercept           coefficients
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
lcr1                                         select x1, x2, x3, y from mldemo.lcr         3             1.0                          1.0                   3.597211844847171E-14 3.9968028886505635E-15 [9.99999999999995, 2.000000000000012, -2.999999999999997]

Fetched 1 row
You can see that the model finds the coefficients perfectly and has a coefficient of determination of 1.0. The coefficient of determination represents the amount of variance in the data explained by the model. A coefficient of determination of 1.0 indicates that all the variance is explained and that the model is a perfect fit. A coefficient of determination of 0.0 indicates that none of the variance is explained. A negative coefficient of determination is possible, but you do not find it in linear regression models because these models always find a fit that is at least as good as the constant function (the average of all the training data). That model has a coefficient of determination of 0.0.

Non-Linear Regression

The non-linear regression model is similar to linear combination regression in that you define the form of the model. However, non-linear regression can be more than a linear combination of terms. Assume this training data. Use the CREATE TABLE AS SELECT SQL statement to create the mldemo.nlo_input1 table. Generate 100 rows of data. Then, create a non-linear regression model using five parameters.
SQL
CREATE TABLE mldemo.nlo_input1 AS (
    SELECT x1,
        x2,
        2.0 * sin(3.0 * x1 - 1.0) - 4.0 + 5.0 * x2 as y
    FROM (
            SELECT a.c1 / 4.0 as x1,
                b.c1 / 4.0 as x2
            FROM sys.dummy10 a,
                sys.dummy10 b
        )
);
Modified 100 rows

CREATE mlmodel nlo1 TYPE NONLINEAR REGRESSION ON (
    SELECT x1,
        x2,
        y
    FROM mldemo.nlo_input1
) options(
    'numParameters'->'5',
    'function'->'a1 * SIN(a2 * x1 + a3) + a4 + a5 * x2',
    'metrics'->'true'
);
Modified 0 rows
In this case, specify only one function that covers the whole model. Use the machine_learning_models and nonlinear_regression_models system catalog tables to see the metrics and the values that the model chose.
SQL
SELECT name,
    on_select
FROM sys.machine_learning_models a,
    sys.nonlinear_regression_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'nlo1';

name                                         on_select                                    coefficients                                                                                       coefficient_of_determination num_arguments rmse                  adjusted_r2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
nlo1                                         select x1, x2, y from mldemo.nlo_input1      [-2.085086077121484, 2.999196008218594, 2.14015015138094, -3.9822034572268925, 4.9800866766427205] 0.9985484616734742           2             0.06623956527655686   0.9985483156287689

Fetched 1 row

If you look at the coefficients, you can see the amplitude (2.0) is negative in the model, but that is offset by subtracting pi from the phase. The model still almost exactly matches how the data is generated, but it just found a slightly different, but equivalent, way of doing it. Non-linear regression models (and all of the other models that use the same infrastructure) have lots of options that control how the search works. These options affect the length of time it takes for training the model, but they also affect the likelihood of getting stuck in a global minimum. The model chooses default values such that they should be sufficient for a wide range of use cases. With non-linear regression models, you can specify a loss function other than least-squares. In this example, specify the abs(f-y) loss function.
SQL
CREATE MLMODEL nlo2 type NONLINEAR REGRESSION ON (
    SELECT x1,
        x2,
        y
    FROM mldemo.nlo_input1
) options(
    'numParameters'->'5',
    'function'->'a1 * sin(a2 * x1 + a3) + a4 + a5 * x2',
    'lossFunction'->'abs(f-y)'
);
Modified 0 rows
In this case, find the best fit that minimizes the sum of the absolute value of the model errors instead of the sum of the square of the model errors. The variable f represents the value of the model, and the variable y represents the actual value in the training data. The catalog indicates the model is nearly the same as the previous model. In other words, changing the loss function here did not substantially change anything. But there are cases where other loss functions are appropriate and they could substantially change the model.

Non-Linear Regression with Neural Networks

In all of the non-linear cases, you have to make a good guess for the form of the target function. However, with neural networks, you can leave it up to the network to try to figure out that form. The downside is that aspects of the model are a black box. This example uses the same example as non-linear regression, but instead of specifying the form of the function, it relies on what the neural net can do with little information. The size of the network layers and the number of layers greatly influence how good of a fit a neural net model has. This example uses a simple network.
SQL
CREATE MLMODEL nn1 TYPE FEEDFORWARD NETWORK ON (
    SELECT x1,
        x2,
        y
    FROM mldemo.nlo_input1
) options(
    'metrics'->'true',
    'hiddenLayers'->'2',
    'hiddenLayerSize'->'2',
    'outputs'->'1',
    'lossFunction'->'squared_error'
);
Modified 0 rows
In the case of neural network models, the metric used is an average loss because RMSE and the coefficient of determination are generally only appropriate for least squares models, and neural networks might be using other loss functions. You can see the average loss by querying the machine_learning_models and feedforward_network_models system catalog tables.
SQL
SELECT name,
    on_select,
    num_arguments,
    average_loss
FROM sys.machine_learning_models a,
    sys.feedforward_network_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'nn1';

name                                         on_select                                    num_arguments average_loss
------------------------------------------------------------------------------------------------------------------------------
nn1                                          select x1, x2, y from mldemo.nlo_input1      2             0.026744549902417786

Fetched 1 row
Compute the coefficient of determination manually.
SQL
SELECT AVG(y)
FROM mldemo.nlo_input1;

(sum(y))/(null_if(double(count(y)), (0)))
------------------------------------------
2.8698217335180987

Fetched 1 row

SELECT 1.0 - SUM((nn1(x1, x2) - y) ^ 2) / SUM((y - 2.8698217335180987) ^ 2) AS r2
FROM mldemo.nlo_input1;

r2
----------------------
0.9981701280250409

Fetched 1 row
This model fits the data very well.

Vector-Valued Regression

Vector-valued regression is the case where the dependent variable has multiple components. You can use any type of model to learn each component separately, or you can use a feedforward network to learn all components in one model. The example creates 100 rows of sample data.
SQL
CREATE TABLE mldemo.vector_valued AS (
    SELECT x1,
        { { 2 * x1 + 3,
        3 * x1 -2 } } as y
    FROM (
            SELECT c1 AS x1
            FROM sys.dummy100
        )
);
Modified 100 rows
There is a dependent variable that depends on x1, which is vector-valued. Each component can fit a linear model, which can then predict vector-valued results.
SQL
CREATE MLMODEL vv1 TYPE SIMPLE LINEAR REGRESSION ON (
    SELECT x1,
        y [1,1]
    FROM mldemo.vector_valued
);
Modified 0 rows

CREATE MLMODEL vv2 TYPE SIMPLE LINEAR REGRESSION ON (
    SELECT x1,
        y [1,2]
    FROM mldemo.vector_valued
);
Modified 0 rows
The query verifies that it produces reasonable results.
SQL
SELECT x1,
    y AS actual,
    { { vv1(x1),
    vv2(x1) } } AS predicted
FROM mldemo.vector_valued
LIMIT 10;

x1         actual                                                                          predicted
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
51         [[105.0, 151.0]]                                                                [[105.0, 151.0]]
52         [[107.0, 154.0]]                                                                [[107.0, 154.0]]
53         [[109.0, 157.0]]                                                                [[109.0, 157.0]]
54         [[111.0, 160.0]]                                                                [[111.0, 160.0]]
55         [[113.0, 163.0]]                                                                [[113.0, 163.0]]
56         [[115.0, 166.0]]                                                                [[115.0, 166.0]]
57         [[117.0, 169.0]]                                                                [[117.0, 169.0]]
58         [[119.0, 172.0]]                                                                [[119.0, 172.0]]
59         [[121.0, 175.0]]                                                                [[121.0, 175.0]]
60         [[123.0, 178.0]]                                                                [[123.0, 178.0]]

Fetched 10 rows
This example shows that a single model can perform this with the feedforward network model type.
SQL
CREATE MLMODEL vv3 TYPE FEEDFORWARD NETWORK ON (
    SELECT x1,
        y
    FROM mldemo.vector_valued
) options(
    'hiddenLayers'->'2',
    'hiddenLayerSize'->'2',
    'outputs'->'2',
    'lossFunction'->'vector_squared_error'
);
Modified 0 rows
outputs, which is set to the size of the vector, and the lossFunction option is set to use vector_squared_error. This example shows how well the model can predict the output.
SQL
SELECT x1,
    y AS actual,
    vv3(x1) AS predicted
FROM mldemo.vector_valued
LIMIT 10;

x1         actual                                                                          predicted
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
51         [[105.0, 151.0]]                                                                [[104.99939226614688, 150.99924962001512]]
52         [[107.0, 154.0]]                                                                [[106.99940272332195, 153.99914250568781]]
53         [[109.0, 157.0]]                                                                [[108.999413180497, 156.99903539136054]]
54         [[111.0, 160.0]]                                                                [[110.99942363767205, 159.99892827703323]]
55         [[113.0, 163.0]]                                                                [[112.9994340948471, 162.99882116270598]]
56         [[115.0, 166.0]]                                                                [[114.99944455202213, 165.99871404837862]]
57         [[117.0, 169.0]]                                                                [[116.99945500919716, 168.99860693405134]]
58         [[119.0, 172.0]]                                                                [[118.99946546637223, 171.99849981972403]]
59         [[121.0, 175.0]]                                                                [[120.99947592354728, 174.99839270539675]]
60         [[123.0, 178.0]]                                                                [[122.99948638072233, 177.99828559106948]]

Fetched 10 rows

Autoregression

Autoregression is linear regression where the independent variables are lags of the dependent variable. In other words, it is a series where the next value depends linearly on the previous N values. These examples do not use a new model type. Instead, they use multiple linear regression models. One of the important things about autoregression is that some variables indicate the order of the data. This variable can be a time variable, but it does not need to be. This variable is necessary to correctly set up the model, but it is not directly used by the model. Instead, the variable helps build the correct lags. This example builds a model to predict the Fibonacci sequence. This statement creates a table with the first several Fibonacci terms.
SQL
CREATE TABLE mldemo.fib AS (
    SELECT 0 AS t,
        0 AS f
    UNION ALL
    SELECT 1,
        1
    UNION ALL
    SELECT 2,
        1
    UNION ALL
    SELECT 3,
        2
    UNION ALL
    SELECT 4,
        3
    UNION ALL
    SELECT 5,
        5
    UNION ALL
    SELECT 6,
        8
    UNION ALL
    SELECT 7,
        13
    UNION ALL
    SELECT 8,
        21
    UNION ALL
    SELECT 9,
        34
);
Modified 10 rows
This example builds a model where the dependent variable depends on two lags.
SQL
CREATE MLMODEL fib TYPE MULTIPLE LINEAR REGRESSION ON (
    SELECT LAG(f, 1) OVER(
            ORDER BY t
        ),
        LAG(f, 2) OVER(
            ORDER BY t
        ),
        f
    FROM mldemo.fib
);
Modified 0 rows
When you execute the model, which provides the model with two lags, the model can predict the next value of the sequence that is not in the data, 55.
SQL
SELECT fib(21, 34);

((((1.0000000000001137))*((21)))+((1.4210854715202004e-14)))+(((1))*((34)))
----------------------------------------------------------------------------
55.0000000000024

Fetched 1 row

Vector Autoregression

Vector autoregression is just a more complex version of autoregression. This model uses multiple time series, where the next value for each time series depends linearly on the lags of all of the time series. This statement sets up some data where the next value for two time series depends on the previous two values of both time series.
SQL
CREATE TABLE mldemo.var_input AS (
    SELECT 3.1415 AS s1,
        1.4142 AS s2,
        0 AS t
    UNION ALL
    SELECT 2.7182 AS s1,
        1.618 AS s2,
        1 AS t
);
INSERT INTO mldemo.var_input
SELECT 0.9 * s1l1 - 0.8 * s2l1 + 0.7 * s1l2 - 0.6 * s2l2 AS s1,
    0.5 * s1l1 - 0.4 * s2l1 + 0.3 * s1l2 - 0.2 * s2l2 AS s2,
    2 AS t
FROM (
        SELECT s1 AS s1l1,
            s2 AS s2l1
        FROM mldemo.var_input
        WHERE t = 1
    ),
    (
        SELECT s1 AS s1l2,
            s2 AS s2l2
        FROM mldemo.var_input
        WHERE t = 0
    );
INSERT INTO mldemo.var_input
SELECT 0.9 * s1l1 - 0.8 * s2l1 + 0.7 * s1l2 - 0.6 * s2l2 AS s1,
    0.5 * s1l1 - 0.4 * s2l1 + 0.3 * s1l2 - 0.2 * s2l2 AS s2,
    3 AS t
FROM (
        SELECT s1 AS s1l1,
            s2 AS s2l1
        FROM mldemo.var_input
        WHERE t = 2
    ),
    (
        SELECT s1 AS s1l2,
            s2 AS s2l2
        FROM mldemo.var_input
        WHERE t = 1
    );
INSERT INTO mldemo.var_input
SELECT 0.9 * s1l1 - 0.8 * s2l1 + 0.7 * s1l2 - 0.6 * s2l2 AS s1,
    0.5 * s1l1 - 0.4 * s2l1 + 0.3 * s1l2 - 0.2 * s2l2 AS s2,
    4 AS t
FROM (
        SELECT s1 AS s1l1,
            s2 AS s2l1
        FROM mldemo.var_input
        WHERE t = 3
    ),
    (
        SELECT s1 AS s1l2,
            s2 AS s2l2
        FROM mldemo.var_input
        WHERE t = 2
    );
INSERT INTO mldemo.var_input
SELECT 0.9 * s1l1 - 0.8 * s2l1 + 0.7 * s1l2 - 0.6 * s2l2 AS s1,
    0.5 * s1l1 - 0.4 * s2l1 + 0.3 * s1l2 - 0.2 * s2l2 AS s2,
    5 AS t
FROM (
        SELECT s1 AS s1l1,
            s2 AS s2l1
        FROM mldemo.var_input
        WHERE t = 4
    ),
    (
        SELECT s1 AS s1l2,
            s2 AS s2l2
        FROM mldemo.var_input
        WHERE t = 3
    );
INSERT INTO mldemo.var_input
SELECT 0.9 * s1l1 - 0.8 * s2l1 + 0.7 * s1l2 - 0.6 * s2l2 AS s1,
    0.5 * s1l1 - 0.4 * s2l1 + 0.3 * s1l2 - 0.2 * s2l2 AS s2,
    6 AS t
FROM (
        SELECT s1 AS s1l1,
            s2 AS s2l1
        FROM mldemo.var_input
        WHERE t = 5
    ),
    (
        SELECT s1 AS s1l2,
            s2 AS s2l2
        FROM mldemo.var_input
        WHERE t = 4
    );
INSERT INTO mldemo.var_input
SELECT 0.9 * s1l1 - 0.8 * s2l1 + 0.7 * s1l2 - 0.6 * s2l2 AS s1,
    0.5 * s1l1 - 0.4 * s2l1 + 0.3 * s1l2 - 0.2 * s2l2 AS s2,
    7 AS t
FROM (
        SELECT s1 AS s1l1,
            s2 AS s2l1
        FROM mldemo.var_input
        WHERE t = 6
    ),
    (
        SELECT s1 AS s1l2,
            s2 AS s2l2
        FROM mldemo.var_input
        WHERE t = 5
    );
INSERT INTO mldemo.var_input
SELECT 0.9 * s1l1 - 0.8 * s2l1 + 0.7 * s1l2 - 0.6 * s2l2 AS s1,
    0.5 * s1l1 - 0.4 * s2l1 + 0.3 * s1l2 - 0.2 * s2l2 AS s2,
    8 AS t
FROM (
        SELECT s1 AS s1l1,
            s2 AS s2l1
        FROM mldemo.var_input
        WHERE t = 7
    ),
    (
        SELECT s1 AS s1l2,
            s2 AS s2l2
        FROM mldemo.var_input
        WHERE t = 6
    );
The query inspects the data.
SQL
SELECT *
FROM mldemo.var_input
ORDER BY t;

s1                    s2                    t
----------------------------------------------------------------
3.1415                1.4142                0
2.7182                1.618                 1
2.50251               1.37151               2
2.086991              1.1945109999999999    3
1.8515340999999994    1.0421421             4
1.5768541099999998    0.8961053099999997    5
1.3730730609999995    0.7770167409999997    6
1.1802870530999996    0.6695650050999997    7
1.0215474418099997    0.5788360946099997    8

Fetched 9 rows
This example fits the vector autoregression model to the data.
SQL
CREATE MLMODEL var1 type VECTOR AUTOREGRESSION ON (
    SELECT { { s1,
        s2 } },
        { { s1_lag1,
        s2_lag1 } },
        { { s1_lag2,
        s2_lag2 } }
    FROM (
            SELECT s1,
                s2,
                LAG(s1, 1) OVER(
                    ORDER BY t
                ) as s1_lag1,
                LAG(s1, 2) OVER(
                    ORDER BY t
                ) as s1_lag2,
                LAG(s2, 1) OVER(
                    ORDER BY t
                ) as s2_lag1,
                LAG(s2, 2) OVER(
                    ORDER BY t
                ) AS s2_lag2
            FROM mldemo.var_input
        )
) options('numVariables'->'2', 'numLags'->'2');
Modified 0 rows
See the details of the model in the machine_learning_models and vector_autoregression_models system catalog tables.
SQL
SELECT NAME,
    coefficients,
    coefficient_of_determination
FROM sys.machine_learning_models a,
    sys.vector_autoregression_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'var1';

name                                         coefficients                                                                    coefficient_of_determination
----------------------------------------------------------------------------------------------------------------------------------------------------------
var1                                         [-2.2737367544323206E-12, 0.8999999999432475, -0.799999999901047, 0.6999999999643478, -0.599999999973079, -5.684341886080801E-13, 0.4999999999781721, -0.39999999997235136, 0.29999999999017746, -0.1999999999825377] NULL

Fetched 1 row
The model recovered the expected coefficients. The coefficient of determination is NULL here because the model did not enable metrics collection in the options. This example checks that the model matches the data reasonably well.
SQL
SELECT var1({ { s1_lag1, s2_lag1 } }, { { s1_lag2, s2_lag2 } }) AS predicted,
    { { s1,
    s2 } } AS actual
FROM (
        SELECT s1,
            s2,
            s1_lag1,
            s2_lag1,
            s1_lag2,
            s2_lag2
        FROM (
                SELECT s1,
                    s2,
                    LAG(s1, 1) OVER(
                        ORDER BY t
                    ) AS s1_lag1,
                    LAG(s1, 2) OVER(
                        ORDER BY t
                    ) AS s1_lag2,
                    LAG(s2, 1) OVER(
                        ORDER BY t
                    ) AS s2_lag1,
                    LAG(s2, 2) OVER(
                        ORDER BY t
                    ) AS s2_lag2
                FROM mldemo.var_input
            )
    );

predicted                                                                       actual
----------------------------------------------------------------------------------------------------------------------------------------------------------------
NULL                                                                            [[3.1415, 1.4142]]
NULL                                                                            [[2.7182, 1.618]]
[[2.5025099999296385, 1.3715099999786722]]                                      [[2.50251, 1.37151]]
[[2.086990999938066, 1.1945109999842818]]                                       [[2.086991, 1.1945109999999999]]
[[1.8515340999451868, 1.0421420999862723]]                                      [[1.8515340999999994, 1.0421421]]
[[1.5768541099535214, 0.8961053099881894]]                                      [[1.5768541099999998, 0.8961053099999997]]
[[1.3730730609589523, 0.7770167409897997]]                                      [[1.3730730609999995, 0.7770167409999997]]
[[1.1802870530645944, 0.6695650050911028]]                                      [[1.1802870530999996, 0.6695650050999997]]
[[1.0215474417789623, 0.5788360946022622]]                                      [[1.0215474418099997, 0.5788360946099997]]

Fetched 9 rows
The first two values have no predictions because the lags necessary to drive the predictions do not exist. For details, see Vector Autoregression. Machine Learning Model Functions Regression Models
Last modified on May 27, 2026