> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Clustering and Dimension Reduction Models

export const OcientML = "OcientML™";

{OcientML} supports these machine learning models for clustering and dimension reduction.

To create the model, use the `CREATE MLMODEL` syntax. For details, see [CREATE MLMODEL](/machine-learning-model-functions).

<Info>
  Model option names are case-sensitive.
</Info>

## Principal Component Analysis

Model Type: `PRINCIPAL COMPONENT ANALYSIS`

Principal Component Analysis (PCA) is typically not used as a model on its own. PCA is most commonly used on the inputs to other models. PCA serves two purposes.

1. PCA normalizes all numeric feature data. Some types of models are sensitive to the scale of numeric features, and when different features have different scales, the results end up skewed. PCA normalizes all features to the same scale.
2. PCA is used for dimensionality reduction. PCA computes linear combinations of the original features to put the most signal into a smaller number of new features.

The input result set when creating a PCA model is N numeric columns, which are all features. There is no label or dependent variable. After you create a PCA model, the `sys.principal_component_analysis_models` system catalog table contains information on the percentage of the signal in each PCA feature. You can use this information to figure out how many of the output features to keep.

Also, after you create the model, you can see its details by querying the `sys.machine_learning_models` and `sys.machine_learning_model_options` system catalog tables.

### Principal Component Analysis and Logistic Regression

You can also use PCA models as inputs to other models. For example, if you have three features and you want to use PCA to reduce the number to two features, you can execute the following SQL statements.

```sql SQL theme={null}
CREATE MLMODEL reduceTo2
TYPE PRINCIPAL COMPONENT ANALYSIS ON (
  SELECT
    c1,
    c2,
    c3
  FROM public.my_table
);
```

You can use this model as input for another model, for example, logistic regression.

```sql SQL theme={null}
CREATE MLMODEL binaryClass
TYPE LOGISTIC REGRESSION ON (
  SELECT
    reduceTo2(c1, c2, c3, 1),
    reduceTo2(c1, c2, c3, 2),
    label
  FROM …
);
```

To correctly use this model later, you must pass the original features through the PCA model when you execute the logistic regression model.

```sql SQL theme={null}
SELECT
  binaryClass(
    reduceTo2(x1, x2, x3, 1),
    reduceTo2(x1, x2, x3, 2)
  )
FROM … ;
```

Similarly, to create a PCA analysis over four variables, execute this SQL statement:

```sql SQL theme={null}
CREATE MLMODEL my_model
TYPE PRINCIPAL COMPONENT ANALYSIS ON (
  SELECT
    c1,
    c2,
    c3,
    c4
  FROM public.my_table
);
```

### **Model Options**

`featureArray` — If you set this option to `true`, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is `false`.

`loadBalance` — If you set this option, the database appends the `USING load_balance_shuffle = <value>` clause to all intermediate SQL queries the model executes during training where `value` is the specified option value (`true` or `false`). The default value is unspecified. In this case, the database does not add this clause.

`queryInternalParallelism` — If you set this option, the database appends the `USING parallelism = <value>` clause to all intermediate SQL queries the model executes during training where `value` is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.

`skipDropTable` — If you set this option to `false`, the database deletes any intermediate tables created during model training. If you set this option to `true`, the database prevents the deletion of any intermediate tables created during model training. The default value is `false`.

## K-Means Clustering

Model Type: `KMEANS`

K-means is an unsupervised clustering algorithm. All of the columns in the input result set are features, and there is no label. All the input columns must be numeric. The algorithm finds `k` points such that all points are classified by the closest `k` points. Distance calculations are Euclidean by default.

### **Model Options**

#### Required

`k` — This option must be a positive integer. The option specifies the algorithm for how many clusters to make.

#### Optional

`epsilon` — If you specify this option, the value must be a valid positive floating point value. When the maximum distance that a centroid moves from one iteration of the algorithm to the next is less than this value, the algorithm terminates. This parameter defaults to `1e-4`.

`normalize` — If you set this option to `true`, the model normalizes the data before the start of training. The default value is `true`.

`featureArray` — If you set this option to `true`, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is `false`.

`loadBalance` — If you set this option, the database appends the `USING load_balance_shuffle = <value>` clause to all intermediate SQL queries the model executes during training where `value` is the specified option value (`true` or `false`). The default value is unspecified. In this case, the database does not add this clause.

`queryInternalParallelism` — If you set this option, the database appends the `USING parallelism = <value>` clause to all intermediate SQL queries the model executes during training where `value` is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.

`skipDropTable` — If you set this option to `false`, the database deletes any intermediate tables created during model training. If you set this option to `true`, the database prevents the deletion of any intermediate tables created during model training. The default value is `false`.

### **Execute the Model**

Create a K-means model with `8` clusters.

```sql SQL theme={null}
CREATE MLMODEL my_model
TYPE KMEANS
ON (
  SELECT
    x1,
    x2,
    x3,
    x4
  FROM public.my_table
)
options(
  'k' -> '8'
);
```

Also, after you create the model, you can see its details by querying the `sys.machine_learning_models` and `sys.machine_learning_model_options` system catalog tables.

Because there are no labels for clusters, when you execute this function after training with the same number (and same order) of features as input, the result is an integer that specifies the cluster to which the point belongs.

```sql SQL theme={null}
SELECT my_model(x1, x2, x3, x4) FROM my_table;
```

After you execute a model, you can find the details about the execution results in the `sys.k_means_models` system catalog table.

For details, see the description of the associated system catalog tables in the Machine Learning section in the [System Catalog](/system-catalog#page-title).

## Gaussian Mixture

Model Type: `GAUSSIAN MIXTURE MODEL`

Gaussian Mixture Models (GMMs) find the mixture of `N` Gaussian distributions that best fit the data. The most common use is unsupervised clustering, but there are some key differences compared to K-means clustering:

* GMMs can handle clusters that are not circular, i.e., clusters in the shape of ovals or higher-dimensional analogs with variances in different directions.
* Clusters can have an arbitrary rotation, i.e., they can have covariance.
* GMMs can handle instances when clusters are not as equally likely. If a point is right between two clusters, it is more likely to be more common in the training data.
* GMMs can show the probability of a new point belonging to each cluster instead of providing only a single cluster value.

GMMs generally handle more complex data than K-means models, but they require more training and processing time during execution.

### Model Options

#### Required

`numDistributions` — Must be a positive integer. This value specifies the number of clusters of Gaussian distributions for the model to make.

#### Optional

`epsilon` — If you specify this option, the value must be a valid positive floating point value. When the maximum distance that the entire best model moves in its n-dimensional space is less than this value, the algorithm terminates. This parameter defaults to 1e-8 (0.00000001 as a decimal number).

`normalize` — If you set this option to `true`, the model automatically computes the mean and standard deviation of each feature and uses them to normalize the data during training. Defaults to `true`.

`featureArray` — If you set this option to `true`, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is `false`.

`loadBalance` — If you set this option, the database appends the `USING load_balance_shuffle = <value>` clause to all intermediate SQL queries the model executes during training where `value` is the specified option value (`true` or `false`). The default value is unspecified. In this case, the database does not add this clause.

`queryInternalParallelism` — If you set this option, the database appends the `USING parallelism = <value>` clause to all intermediate SQL queries the model executes during training where `value` is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.

`skipDropTable` — If you set this option to `false`, the database deletes any intermediate tables created during model training. If you set this option to `true`, the database prevents the deletion of any intermediate tables created during model training. The default value is `false`.

### Execute the Model

Create a GMM with three clusters of Gaussian distributions.

```sql SQL theme={null}
CREATE MLMODEL gmm
TYPE GAUSSIAN MIXTURE MODEL
ON (
  SELECT
    x,
    y,
    z
  FROM my_table
)
options(
  'numDistributions' -> '3'
);
```

Also, after you create the model, you can see its details by querying the `sys.machine_learning_models` and `sys.machine_learning_model_options` system catalog tables.

When you execute this function after training with the same number (and same order) of features as input, the result is an integer that specifies the probable cluster to which the point belongs.

```sql SQL theme={null}
SELECT gmm(x,y,z) FROM my_table;
```

After you execute a model, you can find the details about the execution results in the `sys.gaussian_mixture_models` system catalog table.

For details, see the description of the associated system catalog tables in the Machine Learning section in the [System Catalog](/system-catalog).

## Linear Discriminant Analysis

Model Type: `LINEAR DISCRIMINANT ANALYSIS`

Linear Discriminant Analysis (LDA) is a dimension-reduction technique similar to PCA. While PCA is unsupervised, LDA can classify and use class labels to find linear combinations of features that best separate the classes.

The input result set when you create an LDA model is `N` numeric columns that are all features. The last column is a class label and can be any data type. After you create an LDA model, the `sys.linear_discriminant_analysis_models` system catalog table contains information on the percentage of the signal that is in each LDA feature. You can use this information to figure out how many of the output features to keep.

Also, after you create the model, you can see its details by querying the `sys.machine_learning_models` and `sys.machine_learning_model_options` system catalog tables.

### Linear Discriminant Analysis and Logistic Regression

You can use LDA models as inputs to other models. For example, to reduce the number of features from three to two using LDA, you can execute these SQL statements.

```sql SQL theme={null}
CREATE MLMODEL reduceTo2
TYPE LINEAR DISCRIMINANT ANALYSIS ON (
  SELECT
    c1,
    c2,
    c3,
    label
  FROM public.my_table
);
```

You can use this model as input for another model, for example, logistic regression.

```sql SQL theme={null}
CREATE MLMODEL binaryClass
TYPE LOGISTIC REGRESSION ON (
  SELECT
    reduceTo2(c1, c2, c3, 1),
    reduceTo2(c1, c2, c3, 2),
    label
  FROM …
  );
```

To correctly use this model later, you must pass the original features through the LDA model when you execute the logistic regression model.

```sql SQL theme={null}
SELECT
  binaryClass(
    reduceTo2(x1, x2, x3, 1),
    reduceTo2(x1, x2, x3, 2)
  )
FROM … ;
```

Similarly, to create an LDA model over four variables, execute this SQL statement.

```sql SQL theme={null}
CREATE MLMODEL my_model
TYPE LINEAR DISCRIMINANT ANALYSIS ON (
  SELECT
    c1,
    c2,
    c3,
    c4,
    label
  FROM public.my_table
);
```

### **Model Options**

`normalize` — If you set this option to `true`, the model automatically computes the mean and standard deviation of each feature and uses them to normalize the data during training. Defaults to `true`.

`featureArray` — If you set this option to `true`, the model expects only one array-type column as input instead of multiple columns of training data. Each array row in the input column must be of the same size. Regardless of whether you use this option, the model treats training data the same with labeling and weight scoring. The default value is `false`.

`loadBalance` — If you set this option, the database appends the `USING load_balance_shuffle = <value>` clause to all intermediate SQL queries the model executes during training where `value` is the specified option value (`true` or `false`). The default value is unspecified. In this case, the database does not add this clause.

`queryInternalParallelism` — If you set this option, the database appends the `USING parallelism = <value>` clause to all intermediate SQL queries the model executes during training where `value` is the specified positive integer value. The default value is unspecified. In this case, the database does not add this clause.

`skipDropTable` — If you set this option to `false`, the database deletes any intermediate tables created during model training. If you set this option to `true`, the database prevents the deletion of any intermediate tables created during model training. The default value is `false`.

### Execute the Model

When you execute the trained LDA model, you must provide the same original input features in the same order, followed by a positive integer argument that specifies the LDA component to return. The LDA component index starts at 1.

```sql SQL theme={null}
SELECT
  my_model(col1, col2, col3, col4, 2) as component2,
  my_model(col1, col2, col3, col4, 3) as component3,
FROM public.my_table;
```

## Bibliography

Bahmani, Bahman, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. “Scalable K-Means++.” Proceedings of the VLDB Endowment 5, no. 7 (2012): 622–33. [https://doi.org/10.14778/2180912.2180915](https://doi.org/10.14778/2180912.2180915).

## Related Links

[Classification Models](/classification-models)

[Other Models](/other-models)

[Machine Learning Models](/machine-learning-models)

#
