Clustering Models
Clustering models bear similarities to classification, but they use unsupervised learning, which means they do not use any class labels. Instead, the algorithm tries to identify groupings on its own by finding clusters of data that seem closer to each other and farther away from other clusters. These models generate an integer label, but its ordering is arbitrary. The clustering models require you to specify upfront the number of clusters.K-Means Clustering
K-means is by far the most well-known clustering algorithm because it is simple and fast. K-means performs particularly well if you can scale your features so clusters are roughly circular and equal in size. These examples demonstrate the k-means model. The examples do not use a data set that is primed for optimal performance of the model, such that they show the shortcomings of the k-means model and how it can perform relatively well even with sub-optimal data. The examples use a data set of three-dimensional points.SQL
k value, which represents the number of clusters.
SQL
SQL
0 is the CENTER class, cluster 1 is LEFT, and cluster 2 is RIGHT.
With that information, you can compute the overall accuracy.
SQL
CENTER, but there were some wrong classifications for the LEFT and RIGHT labels, although they were rare.
For details, see K-Means Clustering.
Gaussian Mixture Models
Gaussian mixture models (GMMs) also perform clustering, however they use a significantly more complex algorithm. GMMs can handle several things that k-means cannot, such as:- GMMs can handle clusters that are not circular, i.e., they have different variances in different directions.
- GMMs can handle clusters that have an arbitrary rotation, i.e., they can have covariances.
- GMMs can handle the fact that all clusters might not be equally as likely, i.e. if a point is located right between two clusters, then it is more likely to be the one more common in the training data.
- GMMs can show the probability of a new point belonging to each cluster rather than outputting a single cluster value.
numDistributions option, which represents the number of clusters.
SQL
SQL
(0,0,0) being the second class is very high (greater than 99.9 percent), while the probability of it being the other classes is essentially zero.
To find the most likely class, use the VECTOR_ARGMAX function, which returns a 1-based class index.
SQL
SQL
Dimensionality Reduction
Dimensionality reduction algorithms reduce the number of input features while still keeping as much of the meaningful properties of the data as possible. Models are quicker to build, and often higher quality after a dimensionality reduction model reduces the number of input features. A common first step in an analysis is to use dimensionality reduction to simplify the data. See these examples that demonstrate how to use dimensionality reduction as a preprocessing step before using other model types.Principal Component Analysis
Principal component analysis (PCA) is an unsupervised algorithm that only operates on the inputs, and does not understand what the data is being used for. It is also a linear dimensionality reduction algorithm, meaning that the new features it generates are linear combinations of existing features. PCA generates as many new features as there are input features. So by itself, it is not reducing the number of dimensions. However, PCA creates new features that try to maximize variance and sorts the new features in terms of the amount of variance they contain. The system catalog tables provide information on how much variance is contained by the new features. You can use this information to find how many trailing new features to drop. This PCA tutorial uses a new data set. Start with a regression problem that is trying to find the best-fit polynomial forf(x1, x2, x3) = y. The tutorial starts with a Polynomial Regression model to see how well it does using three input features, and then it uses PCA to reduce the number of variables without losing accuracy.
This example shows the Polynomial Regression model.
SQL
SQL
SQL
y because PCA operates over only the input features.
Examine the machine_learning_models and principal_component_analysis_models system catalog tables.
SQL
importance value indicates that over 93 percent of the signal is in the first two PCA output features, which means it is possible to have a robust model even if you remove the third feature.
This example uses the Polynomial Regression model to access the PCA output features. The example executes the pca1 PCA function with the PCA input features followed by the number of features. This number starts at 1, so this example uses PCA features 1 and 2, but not the last feature.
SQL
SQL
importance value of the PCA output features in the machine_learning_models and principal_component_analysis_models system catalog tables.
SQL
Linear Discriminant Analysis
While PCA is unsupervised, linear discriminant analysis (LDA) is a supervised dimensionality reduction algorithm. LDA works only with numeric classification algorithms, but the classification can be binary or multi-class. Hence, LDA understands how the data should be used, making it operate differently. Creating an LDA model is mostly similar to PCA. One difference is that LDA also requires a class or label. This tutorial uses the three-dimensional cluster data that the PCA model struggled to work with. The example uses LDA to reduce the features of the cluster data by using the label already contained in the data set.SQL
importance column in the system catalog tables.
SQL
FEEDFORWARD NETWORK model. Note that this model uses the cross_entropy_loss option to enable multi-class classification.
SQL
SQL
0, 0, 0. This input first passes through the LDA model because that is what the neural network model was built on.
The query asks the LDA model for its first component, and then runs the neural network on that value. The neural network outputs a vector of class probabilities; VECTOR_ARGMAX returns the 1‑based index of the highest probability.
In this example, the highest probability class is the third class, which the query defines as CENTER.
SQL

