# Clustering Analysis and Dimensionality Reduction

This tutorial uses examples to explain the clustering and dimensionality reduction capabilities in .

Clustering models bear similarities to classification, but they use unsupervised learning, which means they do not use any class labels. Instead, the algorithm tries to identify groupings on its own by finding clusters of data that seem closer to each other and farther away from other clusters.

These models generate an integer label, but its ordering is arbitrary. The Ocient clustering models require you to specify upfront the number of clusters.

K-means is by far the most well-known clustering algorithm because it is simple and fast. K-means performs particularly well if you can scale your features so clusters are roughly circular and equal in size.

These examples demonstrate the k-means model. The examples do not use a data set that is primed for optimal performance of the model such that they show the shortcomings of the k-means model and how it can perform relatively well even with sub-optimal data.

The examples use a data set of 3-dimensional points.

This example shows a k-means model created over this data. You can specify multiple options, but the only one that is required is the k value, which represents the number of clusters.

This data set has labels, but they are hidden from the model to allow for unsupervised learning.

These example queries compare how well the unsupervised k-means clustering performs against the actual correct classes and labels.

Before you can do that, you must figure out which cluster numbers correspond with which labels.

The results indicate that cluster 0 is the CENTER class, cluster 1 is LEFT, and cluster 2 is RIGHT.

With that information, you can compute the overall accuracy.

The accuracy is very good. This calculation indicates that the model never miscategorized CENTER, but there were some wrong classifications for the LEFT and RIGHT labels, although they were rare.

For more information, see K-Means Clustering.

Gaussian mixture models (GMMs) also perform clustering, however they use a significantly more complex algorithm. GMMs can handle several things that k-means cannot, such as:

- GMMs can handle clusters that are not circular, i.e. they have different variances in different directions.
- GMMs can handle clusters that have an arbitrary rotation, i.e. they can have covariances.
- GMMs can handle the fact that all clusters might not be equally as likely, i.e. if a point is located right between two clusters, then it is more likely to be the one more common in the training data.
- GMMs can show the probability of a new point belonging to each of the clusters, rather than outputting a single cluster value.

GMMs operate by finding a weighted mixture of k multi-variate Gaussians that is most likely to represent the population from which the data was sampled. Each Gaussian in the mix has a mean vector, which represents the center of each cluster.

After you create the model, it is simple to determine the cluster that a point most likely belongs to. The examples demonstrate how to find the highest probability cluster.

This example makes a model over the same data set. This model requires the numDistributions option, which represents the number of clusters.

This query executes the GMM model.

The results are very different than executing a k-means model because these are probabilities. In this case, the output means the probability of (0,0,0) being the second class is very high (greater than 99.9 percent) while the probability of it being the other classes is essentially zero.

To find the most likely class, use the vector_argmax() function.

You can write a query to see how accurate this model is. But first, you must figure out the association between classes and labels.

This model misclassified 20 rows, which is far better than the 7,042 rows the k-means model misclassified. This is a direct result of the additional complexity of the GMM.

GMM models can handle much more complex situations than k-means models, but this comes at the cost of more time training and executing the model. The time a GMM model takes compared to a k-means model depends on the number of clusters.

For more information, see Gaussian Mixture.

Dimensionality reduction algorithms reduce the number of input features while still keeping as much of the meaningful properties of the data as possible. Models are quicker to build, and often higher quality after a dimensionality reduction model reduces the number of input features.

A common first step in an analysis is to use dimensionality reduction to simplify the data. See these examples that demonstrate how to use dimensionality reduction as a preprocessing step before using other model types.

Principal component analysis (PCA) is an unsupervised algorithm that only operates on the inputs, and does not understand what the data is being used for. It is also a linear dimensionality reduction algorithm, meaning that the new features it generates are linear combinations of existing features.

PCA generates as many new features as there are input features. So by itself, it is not reducing the number of dimensions. However, PCA creates new features that try to maximize variance and sorts the new features in terms of the amount of variance they contain. The system catalog tables provide information on how much variance is contained by the new features. You can use this information to find how many trailing new features to drop.

This PCA walkthrough uses a new data set. Start with a regression problem that is trying to find the best-fit polynomial for f(x1, x2, x3) = y. The walkthrough starts with a Polynomial Regression model to see how well it does using three input features, and then it uses PCA to reduce the number of variables without losing accuracy.

This example shows the Polynomial Regression model.

The system catalog table shows that the model fits the data perfectly.

In this instance, PCA can reduce the three independent variables to two.

The first step is to build a PCA model over the input features.

The model does not include y because PCA operates over only the input features.

Examine the machine_learning_models and principal_component_analysis_models system catalog tables.

The importance value indicates that over 93 percent of the signal is in the first two PCA output features, which means it is possible to have a robust model even if you remove the third feature.

This example uses the Polynomial Regression model to access the PCA output features. The example executes the pca1 PCA function with the PCA input features followed by the number of features. This number starts at 1, so this example uses PCA features 1 and 2, but not the last feature.

Notice that the example still references all three input features, but this model now has only two independent variables. As a result, the model trains 15 percent faster than the version with three independent variables. It is also a much simpler model. The model has six terms in the polynomial regression instead of 10, when there were three independent variables.

This query examines how well the model fits the data.

The model is a great fit with more than 99 percent accuracy.

In contrast, here is an example where PCA is not a good idea. This example goes back to the 3-dimensional clusters data set used in the Clustering Models examples.

This example creates the PCA model and checks the importance value of the PCA output features in the machine_learning_models and principal_component_analysis_models system catalog tables.

In this case, all three PCA output features are fairly evenly weighted. By removing the last feature, the model covers only about 70% of the signal. This strongly indicates that proceeding with a two-feature model would be problematic and inaccurate.

While PCA is unsupervised, linear discriminant analysis (LDA) is a supervised dimensionality reduction algorithm. LDA works only with numeric classification algorithms, but the classification can be binary or multi-class. Hence, LDA understands how the data should be used, making it operate differently.

Creating an LDA model is mostly similar to PCA. One difference is that LDA also requires a class or label.

This walkthrough uses the 3-dimensional cluster data that the PCA model struggled to work with. The example uses LDA to reduce the features of the cluster data by using the label already contained in the data set.

As with PCA, there is an importance column in the system catalog tables.

This output strongly indicates that only the first LDA output feature matters, which means that it is possible to make a single feature classification model.

This example takes the LDA-reduced model as input and uses it to create a neural network model, using the FEEDFORWARD NETWORK model. Note that this model uses the cross_entropy_loss option to enable multi-class classification.

Execute the model for a new point to see if it returns a reasonable value.

In this case, the query asks for a prediction for the input 0, 0, 0. This input first passes through the LDA model because that is what the neural network model was built on.

The query asks the LDA model for its first component, and the query executes the neural network model with that value. The highest probability class it returns is the third class, defined as CENTER in the query.

This model is nearly 82 percent accurate despite being simplified from three features to one. While the accuracy is worse, there is a huge reduction in the time required to train this model.