Analysis in Ocient
Machine Learning in Ocient
Clustering Analysis and Dimensionality Reduction
this tutorial uses examples to explain the clustering and dimensionality reduction capabilities in {{ocientml}} clustering models clustering models bear similarities to classification, but they use unsupervised learning, which means they do not use any class labels instead, the algorithm tries to identify groupings on its own by finding clusters of data that seem closer to each other and farther away from other clusters these models generate an integer label, but its ordering is arbitrary the ocient clustering models require you to specify upfront the number of clusters k means clustering k means is by far the most well known clustering algorithm because it is simple and fast k means performs particularly well if you can scale your features so clusters are roughly circular and equal in size these examples demonstrate the k means model the examples do not use a data set that is primed for optimal performance of the model such that they show the shortcomings of the k means model and how it can perform relatively well even with sub optimal data the examples use a data set of three dimensional points select x, y, z from mldemo clusters 3d limit 10; x y z \ 2 332818550827427 2 1009728159912044 0 7199553405333495 0 27417054580238553 0 29545777905624226 0 6797340230215586 2 8608875606526794 2 8532647291218884 0 7296054608643955 4 9539478834915345 4 987358010702081 0 818608017693332 0 7043958454569171 0 7209720558364705 0 7879961167668906 4 949298606341526 4 956911476243746 0 6015700102502752 2 245935331812275 2 158347482865157 1 0786578947921444 1 3725012757407247 1 5652243270484643 2 5951429274728794 4 516051945094492 4 560705560578962 0 1569348000058415 5 140105808358035 4 9583101530224 1 0064835641873153 fetched 10 rows this example shows a k means model created over this data you can specify multiple options, but the only one that is required is the k value, which represents the number of clusters create mlmodel kmeans1 type kmeans on ( select x, y, z from mldemo clusters 3d ) options('k' >'3'); modified 0 rows this data set has labels, but they are hidden from the model to allow for unsupervised learning these example queries compare how well the unsupervised k means clustering performs against the actual correct classes and labels before you can do that, you must figure out which cluster numbers correspond with which labels select count( ), cluster num from ( select kmeans1(x, y, z) as cluster num from mldemo clusters 3d where label = 'left' ) group by cluster num; count( ) cluster num \ 3165 0 146329 1 fetched 2 rows select count( ), cluster num from ( select kmeans1(x, y, z) as cluster num from mldemo clusters 3d where label = 'right' ) group by cluster num; count( ) cluster num \ 3877 0 145761 2 fetched 2 rows select count( ), cluster num from ( select kmeans1(x, y, z) as cluster num from mldemo clusters 3d where label = 'center' ) group by cluster num; count( ) cluster num \ 700868 0 fetched 1 row the results indicate that cluster 0 is the center class, cluster 1 is left , and cluster 2 is right with that information, you can compute the overall accuracy select count( ) / 1000000 0 from mldemo clusters 3d where ( kmeans1(x, y, z) = 0 and label = 'center' ) or ( kmeans1(x, y, z) = 1 and label = 'left' ) or ( kmeans1(x, y, z) = 2 and label = 'right' ); ( count( ) 0)/((1000000 0)) \ 0 992958 fetched 1 row the accuracy is very good this calculation indicates that the model never miscategorized center , but there were some wrong classifications for the left and right labels, although they were rare for details, see clustering and dimension reduction models docid\ affc3g4 myb9ilf7xsv2i gaussian mixture models gaussian mixture models (gmms) also perform clustering, however they use a significantly more complex algorithm gmms can handle several things that k means cannot, such as gmms can handle clusters that are not circular, i e , they have different variances in different directions gmms can handle clusters that have an arbitrary rotation, i e , they can have covariances gmms can handle the fact that all clusters might not be equally as likely, i e if a point is located right between two clusters, then it is more likely to be the one more common in the training data gmms can show the probability of a new point belonging to each cluster rather than outputting a single cluster value gmms operate by finding a weighted mixture of k multi variate gaussians that is most likely to represent the population from which the data was sampled each gaussian in the mix has a mean vector, which represents the center of each cluster after you create the model, it is simple to determine the cluster that a point most likely belongs to the examples demonstrate how to find the highest probability cluster this example makes a model over the same data set this model requires the numdistributions option, which represents the number of clusters create mlmodel gmm type gaussian mixture model on ( select x, y, z from mldemo clusters 3d ) options('metrics' >'true', 'numdistributions' >'3'); modified 0 rows this query executes the gmm model select gmm(0,0,0) as class probabilities; class probabilities \ \[\[2 0445296213545875e 4, 0 999582295498657, 2 1325153920763012e 4]] fetched 1 row the results are very different than executing a k means model because these are probabilities in this case, the output means the probability of (0,0,0) being the second class is very high (greater than 99 9 percent), while the probability of it being the other classes is essentially zero to find the most likely class, use the vector argmax() function select vector argmax(gmm(0,0,0)) as class probabilities; class probabilities \ 1 fetched 1 row you can write a query to see how accurate this model is but first, you must figure out the association between classes and labels select count( ), cluster num from ( select vector argmax(gmm(x, y, z)) as cluster num from mldemo clusters 3d where label = 'left' ) group by cluster num; count( ) cluster num \ 149483 2 2 0 9 1 fetched 3 rows select count( ), cluster num from ( select vector argmax(gmm(x, y, z)) as cluster num from mldemo clusters 3d where label = 'center' ) group by cluster num; count( ) cluster num \ 700868 1 fetched 1 row select count( ), cluster num from ( select vector argmax(gmm(x, y, z)) as cluster num from mldemo clusters 3d where label = 'right' ) group by cluster num; count( ) cluster num \ 6 1 3 2 149629 0 fetched 3 rows this model misclassified 20 rows, which is far better than the 7,042 rows the k means model misclassified this is a direct result of the additional complexity of the gmm gmm models can handle much more complex situations than k means models, but this comes at the cost of more time training and executing the model the time a gmm model takes compared to a k means model depends on the number of clusters for details, see clustering and dimension reduction models docid\ affc3g4 myb9ilf7xsv2i dimensionality reduction dimensionality reduction algorithms reduce the number of input features while still keeping as much of the meaningful properties of the data as possible models are quicker to build, and often higher quality after a dimensionality reduction model reduces the number of input features a common first step in an analysis is to use dimensionality reduction to simplify the data see these examples that demonstrate how to use dimensionality reduction as a preprocessing step before using other model types principal component analysis principal component analysis (pca) is an unsupervised algorithm that only operates on the inputs, and does not understand what the data is being used for it is also a linear dimensionality reduction algorithm, meaning that the new features it generates are linear combinations of existing features pca generates as many new features as there are input features so by itself, it is not reducing the number of dimensions however, pca creates new features that try to maximize variance and sorts the new features in terms of the amount of variance they contain the system catalog tables provide information on how much variance is contained by the new features you can use this information to find how many trailing new features to drop this pca tutorial uses a new data set start with a regression problem that is trying to find the best fit polynomial for f(x1, x2, x3) = y the walkthrough starts with a regression models docid\ aktp tzntstapyk6 nec model to see how well it does using three input features, and then it uses pca to reduce the number of variables without losing accuracy this example shows the polynomial regression model create mlmodel poly type polynomial regression on ( select x1, x2, x3, y from mldemo pca poly ) options('order' >'2', 'metrics' >'true'); modified 0 rows the system catalog table shows that the model fits the data perfectly select coefficient of determination from sys machine learning models a, sys polynomial regression models b where a id = b machine learning model id and name = 'poly'; coefficient of determination \ 1 0 fetched 1 row in this instance, pca can reduce the three independent variables to two the first step is to build a pca model over the input features create mlmodel pca1 type principal component analysis on ( select x1, x2, x3 from mldemo pca poly ); modified 0 rows the model does not include y because pca operates over only the input features examine the machine learning models and principal component analysis models system catalog tables select importance from sys machine learning models a, sys principal component analysis models b where a id = b machine learning model id and name = 'pca1'; importance \ \[0 6022962127628679, 0 33286202821081917, 0 06484175902631285] fetched 1 row the importance value indicates that over 93 percent of the signal is in the first two pca output features, which means it is possible to have a robust model even if you remove the third feature this example uses the polynomial regression model to access the pca output features the example executes the pca1 pca function with the pca input features followed by the number of features this number starts at 1 , so this example uses pca features 1 and 2 , but not the last feature create mlmodel poly2 type polynomial regression on ( select pca1(x1, x2, x3, 1), pca1(x1, x2, x3, 2), y from mldemo pca poly ) options('order' >'2', 'metrics' >'true'); modified 0 rows notice that the example still references all three input features, but this model now has only two independent variables as a result, the model trains 15 percent faster than the version with three independent variables it is also a much simpler model the model has six terms in the polynomial regression instead of 10, when there were three independent variables this query examines how well the model fits the data select coefficient of determination from sys machine learning models a, sys polynomial regression models b where a id = b machine learning model id and name = 'poly2'; coefficient of determination \ 0 9952839146818417 fetched 1 row the model is a great fit with more than 99 percent accuracy in contrast, here is an example where pca is not a good idea this example goes back to the three dimensional clusters data set used in the clustering analysis and dimensionality reduction docid 9yfewapki2cc4lfytkhxu examples this example creates the pca model and checks the importance value of the pca output features in the machine learning models and principal component analysis models system catalog tables create mlmodel pca2 type principal component analysis on ( select x, y, z from mldemo clusters 3d ); modified 0 rows select importance from sys machine learning models a, sys principal component analysis models b where a id = b machine learning model id and name = 'pca2'; importance \ \[0 3688781607553818, 0 3333390832624229, 0 2977827559821952] fetched 1 row in this case, all three pca output features are fairly evenly weighted by removing the last feature, the model covers only about 70% of the signal this strongly indicates that proceeding with a two feature model would be problematic and inaccurate linear discriminant analysis while pca is unsupervised, linear discriminant analysis (lda) is a supervised dimensionality reduction algorithm lda works only with numeric classification algorithms, but the classification can be binary or multi class hence, lda understands how the data should be used, making it operate differently creating an lda model is mostly similar to pca one difference is that lda also requires a class or label this tutorial uses the three dimensional cluster data that the pca model struggled to work with the example uses lda to reduce the features of the cluster data by using the label already contained in the data set create mlmodel lda1 type linear discriminant analysis on ( select x, y, z, label from mldemo clusters 3d ); modified 0 rows as with pca, there is an importance column in the system catalog tables select importance from sys machine learning models a, sys linear discriminant analysis models b where a id = b machine learning model id and name = 'lda1'; importance \ \[1 0000000000000133, 1 4051695910079908e 20, 1 3342566035320652e 14] fetched 1 row this output strongly indicates that only the first lda output feature matters, which means that it is possible to make a single feature classification model this example takes the lda reduced model as input and uses it to create a neural network model, using the feedforward network model note that this model uses the cross entropy loss option to enable multi class classification create mlmodel three d clusters with 1 feature type feedforward network on ( select lda1(x, y, z, 1), case when label = 'left' then { { 1 0, 0 0, 0 0 } } when label = 'right' then { { 0 0, 1 0, 0 0 } } else { { 0 0, 0 0, 1 0 } } end as target from mldemo clusters 3d ) options( 'metrics' >'true', 'hiddenlayers' >'2', 'hiddenlayersize' >'4', 'outputs' >'3', 'lossfunction' >'cross entropy loss', 'usesoftmax' >'true' ); modified 0 rows execute the model for a new point to see if it returns a reasonable value select three d clusters with 1 feature(lda1(0, 0, 0, 1)) as predicted; predicted \ \[\[0 007386760165583984, 0 0022085315302674473, 0 9904047083041486]] fetched 1 row in this case, the query asks for a prediction for the input 0, 0, 0 this input first passes through the lda model because that is what the neural network model was built on the query asks the lda model for its first component, and the query executes the neural network model with that value the highest probability class it returns is the third class, defined as center in the query select count( ) / 1000000 0 from mldemo clusters 3d where ( vector argmax( three d clusters with 1 feature(lda1(x, y, z, 1)) ) = 0 and label = 'left' ) or ( vector argmax( three d clusters with 1 feature(lda 1(x, y, z, 1)) ) = 1 and label = 'right' ) or ( vector argmax( three d clusters with 1 feature(lda1(x, y, z, 1)) ) = 2 and label = 'center' ); ( count( ) 0)/((1000000 0)) \ 0 818108 fetched 1 row this model is nearly 82 percent accurate despite being simplified from three features to one while the accuracy is worse, the time required to train this model is greatly reduced related links machine learning model functions docid\ jsgwuw5og56fzrve5h10g clustering and dimension reduction models docid\ affc3g4 myb9ilf7xsv2i