> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Preparation

export const Ocient = "Ocient®";

Lag generator and vector assembler SQL functions simplify generating and organizing lagged features for time series data. Each function expands into standard SQL during query processing, so developers can reduce boilerplate and create feature sets with minimal code.

These functions fall into two main types:

* [Lag Generator Functions](#lag-generator-functions) — Create individual lag feature columns.
* [Vector Assembler Functions](#vector-assembler-function) — Group generated lag columns into vectors.

Developers can apply lag features across many {Ocient}-supported models, including:

* Simple Linear Regression (univariate forecasting with lag inputs)
* Multiple Linear Regression (multivariate regression with lagged predictors)
* Polynomial Regression (regression with higher-order lagged terms)
* Linear Combination Regression (custom functions of lagged inputs)
* Nonlinear Regression (arbitrary functions that include lag variables)
* Feedforward Neural Networks (lagged features as input vectors)
* Vector Autoregression (VAR) (multivariate time series with structured lag vectors)
* Autoregression (univariate time series modeled through lagged values)
* Vector Valued Regression (multivariate dependent variables with lagged inputs)

The functions provide general-purpose utilities for time series feature engineering. In particular, VAR models benefit the most from lag functions because the models require large numbers of structured lag vectors.

## Lag Generator Functions

Lag generator functions provide options for creating lagged columns for one or more variables in a single statement.

These functions follow the syntax rules of window aggregate functions as they define the window for the lag computation with an `OVER` and `ORDER BY` clause. For details, see [Window Aggregate Functions](/window-aggregate-functions).

The window parameters of lag generator functions are represented in these examples by `<window_specification>`.

### LAGS

Generates a series of lagged columns for a single variable in one statement. This function simplifies univariate time-series feature creation.

**Syntax**

```sql SQL theme={null}
LAGS(expression, start_lag, end_lag [, step])
    OVER (<window_specification>)
```

| **Argument** | **Type** | **Description**                                                                                                                         |
| ------------ | -------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| `expression` | ANY      | The column or expression to lag.                                                                                                        |
| `start_lag`  | INTEGER  | First lag to generate. This value must be less than the `end_lag` value.                                                                |
| `end_lag`    | INTEGER  | Last lag to generate. This value must be greater than the `start_lag` value.                                                            |
| `step`       | INTEGER  | Optional. <br />Defines the gap between successive lags.  Must be a positive integer. <br />If unspecified, this value defaults to `1`. |

**Example**

This query produces three lagged columns (`sales_lag1`, `sales_lag2`, `sales_lag3`) from the sales column.

```sql SQL theme={null}
SELECT
    LAGS(sales, 1, 3) OVER (ORDER BY day)
FROM store_sales;
```

*Output*

```sql SQL theme={null}
sales_lag1 | sales_lag2 | sales_lag3
-----------+------------+------------
NULL       | NULL       | NULL
100        | NULL       | NULL
120        | 100        | NULL
130        | 120        | 100
```

### LAGS\_ZEROFILL

Generates lagged columns for a single variable and replaces NULL values with `0`.

**Syntax**

```sql SQL theme={null}
LAGS_ZEROFILL(expression, start_lag, end_lag [, step])
    OVER (<window_specification>)
```

| **Argument** | **Type** | **Description**                                                                                                                        |
| ------------ | -------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `expression` | ANY      | The column or expression to lag.                                                                                                       |
| `start_lag`  | INTEGER  | First lag to generate. This value must be less than the `end_lag` value.                                                               |
| `end_lag`    | INTEGER  | Last lag to generate. This value must be greater than the `start_lag` value.                                                           |
| `step`       | INTEGER  | Optional. <br />Defines the gap between successive lags. Must be a positive integer. <br />If unspecified, this value defaults to `1`. |

**Example**

This query produces three lagged sales columns and replaces NULL values with 0.

```sql SQL theme={null}
SELECT
    LAGS_ZEROFILL(sales, 1, 3) OVER (ORDER BY day)
FROM store_sales;
```

*Output*

```sql SQL theme={null}
sales_lag1 | sales_lag2 | sales_lag3
-----------+------------+------------
0          | 0          | 0
100        | 0          | 0
120        | 100        | 0
130        | 120        | 100
```

### MULTI\_LAGS

Generates lagged columns for multiple variables at once. This function is intended primarily for creating features for multivariate time-series models.

**Syntax**

```sql SQL theme={null}
MULTI_LAGS(expr1, expr2, [ ,... ] start_lag, end_lag [, step])
    OVER (<window_specification>)
```

| **Argument**              | **Type** | **Description**                                                                                                                       |
| ------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `expr1, expr2, [ ,... ] ` | ANY      | One or more columns or expressions to lag. Applies the same lag range and step to each input.                                         |
| `start_lag`               | INTEGER  | First lag to generate. This value must be less than the `end_lag` value.                                                              |
| `end_lag`                 | INTEGER  | Last lag to generate. This value must be greater than the `start_lag` value.                                                          |
| `step`                    | INTEGER  | Optional. <br />Defines the gap between successive lags. Must be a positive integer.<br />If unspecified, this value defaults to `1`. |

**Example**
This query creates four lagged columns for both `interest_rate` and `gdp_growth`.

```sql SQL theme={null}
SELECT
    MULTI_LAGS(interest_rate, gdp_growth, 1, 4) OVER (ORDER BY quarter)
FROM economic_data;
```

*Output*

```sql SQL theme={null}
interest_rate_lag1 | interest_rate_lag2 | interest_rate_lag3 | interest_rate_lag4 | gdp_growth_lag1 | gdp_growth_lag2 | gdp_growth_lag3 | gdp_growth_lag4
-------------------+--------------------+--------------------+--------------------+-----------------+-----------------+-----------------+-----------------
NULL               | NULL               | NULL               | NULL               | NULL            | NULL            | NULL            | NULL
2.5                | NULL               | NULL               | NULL               | 1.1             | NULL            | NULL            | NULL
2.7                | 2.5                | NULL               | NULL               | 1.2             | 1.1             | NULL            | NULL
2.8                | 2.7                | 2.5                | NULL               | 1.3             | 1.2             | 1.1             | NULL
```

### MULTI\_LAGS\_ZEROFILL

Generates lagged columns for multiple variables and replaces NULL values with `0`.

**Syntax**

```sql SQL theme={null}
MULTI_LAGS_ZEROFILL(expr1, expr2, [ ,... ] start_lag, end_lag [, step])
    OVER (<window_specification>)
```

| **Argument**              | **Type** | **Description**                                                                                                                       |
| ------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `expr1, expr2, [ ,... ] ` | ANY      | One or more columns or expressions to lag. Applies the same lag range and step to each input.                                         |
| `start_lag`               | INTEGER  | First lag to generate. This value must be less than the `end_lag` value.                                                              |
| `end_lag`                 | INTEGER  | Last lag to generate. This value must be greater than the `start_lag` value.                                                          |
| `step`                    | INTEGER  | Optional. <br />Defines the gap between successive lags. Must be a positive integer.<br />If unspecified, this value defaults to `1`. |

**Example**
This query generates four lagged columns each for `x1`, `x2`, and `x3`, replacing missing values with 0.

```sql SQL theme={null}
SELECT
    MULTI_LAGS_ZEROFILL(x1, x2, x3, 1, 4) OVER (ORDER BY time_col)
FROM public.my_time_series_table;
```

*Output*

```sql SQL theme={null}
x1_lag1 | x1_lag2 | x1_lag3 | x1_lag4 | x2_lag1 | x2_lag2 | x2_lag3 | x2_lag4 | x3_lag1 | x3_lag2 | x3_lag3 | x3_lag4
--------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------
0       | 0       | 0       | 0       | 0       | 0       | 0       | 0       | 0       | 0       | 0       | 0
10      | 0       | 0       | 0       | 20      | 0       | 0       | 0       | 30      | 0       | 0       | 0
11      | 10      | 0       | 0       | 21      | 20      | 0       | 0       | 31      | 30      | 0       | 0
12      | 11      | 10      | 0       | 22      | 21      | 20      | 0       | 32      | 31      | 30      | 0

```

## Vector Assembler Function

A vector assembler function provides options for grouping multiple lagged columns into structured vectors in a single statement. This function makes preparing input data for time series models easier, especially those that require organized lag structures, such as VAR.

### LAG\_VECTORS

Groups lagged columns generated by the `MULTI_LAGS` or `MULTI_LAGS_ZEROFILL` functions into vector columns. Models such as VAR require these structured lag vectors.

The resulting vectors are named `lag_vector_`.

**Syntax**

```sql SQL theme={null}
LAG_VECTORS(expr1, expr2, [ ..., ] start_lag, end_lag [, step])
```

| **Argument**              | **Type** | **Description**                                                                                                                       |
| ------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `expr1, expr2, [ ,... ] ` | NUMERIC  | Two or more lagged columns to group. Applies the same lag range and step to each input.                                               |
| `start_lag`               | INTEGER  | First lag to assemble. This value must be less than the `end_lag` value.                                                              |
| `end_lag`                 | INTEGER  | Last lag to assemble. This value must be greater than the `start_lag` value.                                                          |
| `step`                    | INTEGER  | Optional. <br />Defines the gap between successive lags. Must be a positive integer.<br />If unspecified, this value defaults to `1`. |

**Examples**

**Assemble Lagged Columns into Vectors**

This query uses `MULTI_LAGS_ZEROFILL` to generate lagged columns for `x1`, `x2`, and `x3`, then uses `LAG_VECTORS` to group them into `lag_vector1` through `lag_vector4`.

```sql SQL theme={null}
SELECT
    {x1, x2, x3} AS current_vector,
    LAG_VECTORS(x1, x2, x3, 1, 4)
FROM (
    SELECT
        x1, x2, x3,
        MULTI_LAGS_ZEROFILL(x1, x2, x3, 1, 4) OVER (ORDER BY time_col)
    FROM public.my_time_series_table
) s;
e;
```

*Output*

```sql SQL theme={null}
current_vector | lag_vector1 | lag_vector2 | lag_vector3 | lag_vector4
---------------+-------------+-------------+-------------+-------------
{10, 20, 30}  | {0, 0, 0}   | {0, 0, 0}   | {0, 0, 0}   | {0, 0, 0}
{11, 21, 31}  | {10, 20, 30}| {0, 0, 0}   | {0, 0, 0}   | {0, 0, 0}
{12, 22, 32}  | {11, 21, 31}| {10, 20, 30}| {0, 0, 0}   | {0, 0, 0}
{13, 23, 33}  | {12, 22, 32}| {11, 21, 31}| {10, 20, 30}| {0, 0, 0}

```

**Prepare Data for a VAR Model**

This example shows how to prepare training data for a VAR model with three variables (`x1`, `x2`, `x3`) and four lags. The example demonstrates how the lag generator functions and vector assembler can simplify feature creation compared to writing out every LAG expression manually.

* `MULTI_LAGS_ZEROFILL` generates all individual lagged columns (`x1_lag1`…`x3_lag4`) in a single call, replacing NULL values with zero.
* `LAG_VECTORS` automatically groups these lagged columns into vector inputs (`lag_vector1` ... `lag_vector4`), which the model requires.

```sql SQL theme={null}
CREATE MLMODEL my_var_model
TYPE VECTOR_AUTOREGRESSION ON (
    SELECT
        {x1, x2, x3},
        LAG_VECTORS(x1, x2, x3, 1, 4)
    FROM (
        SELECT
            x1, x2, x3,
            MULTI_LAGS_ZEROFILL(x1, x2, x3, 1, 4) OVER (ORDER BY t)
        FROM public.my_time_series_table
    )
)
OPTIONS (
    'numVariables' -> '3',
    'numLags'      -> '4'
);
```

**Prepare Data for a VAR Model with Manual Vector Assembly**

This version of the univariate regression example uses repeated `LAG` invocations instead of the `LAGS` function. Although it produces the exact same set of lagged features (`sales_lag1`, `sales_lag2`, `sales_lag3`), the query requires writing each `LAG` expression manually. This makes the SQL longer, more repetitive, and harder to maintain compared to the concise single-line `LAGS` version.

```sql SQL theme={null}
CREATE MLMODEL my_var_model_manual
TYPE VECTOR_AUTOREGRESSION ON (
    SELECT
        {x1, x2, x3},
        {x1_lag1, x2_lag1, x3_lag1},
        {x1_lag2, x2_lag2, x3_lag2},
        {x1_lag3, x2_lag3, x3_lag3},
        {x1_lag4, x2_lag4, x3_lag4}
    FROM (
        SELECT
            x1, x2, x3,
            COALESCE(LAG(x1, 1) OVER (ORDER BY t), 0) AS x1_lag1,
            COALESCE(LAG(x1, 2) OVER (ORDER BY t), 0) AS x1_lag2,
            COALESCE(LAG(x1, 3) OVER (ORDER BY t), 0) AS x1_lag3,
            COALESCE(LAG(x1, 4) OVER (ORDER BY t), 0) AS x1_lag4,
            COALESCE(LAG(x2, 1) OVER (ORDER BY t), 0) AS x2_lag1,
            COALESCE(LAG(x2, 2) OVER (ORDER BY t), 0) AS x2_lag2,
            COALESCE(LAG(x2, 3) OVER (ORDER BY t), 0) AS x2_lag3,
            COALESCE(LAG(x2, 4) OVER (ORDER BY t), 0) AS x2_lag4,
            COALESCE(LAG(x3, 1) OVER (ORDER BY t), 0) AS x3_lag1,
            COALESCE(LAG(x3, 2) OVER (ORDER BY t), 0) AS x3_lag2,
            COALESCE(LAG(x3, 3) OVER (ORDER BY t), 0) AS x3_lag3,
            COALESCE(LAG(x3, 4) OVER (ORDER BY t), 0) AS x3_lag4
        FROM public.my_time_series_table
    )
)
OPTIONS (
    'numVariables' -> '3',
    'numLags'      -> '4'
);
```

## Related Links

[Window Aggregate Functions](/window-aggregate-functions)

[Machine Learning in Ocient](/machine-learning-in-ocient)

[Machine Learning Models](/machine-learning-models)
