Analysis in Ocient
Machine Learning in Ocient
Classification Analysis
classification machine learning algorithms are available in {{ocientml}} there are many different popular classification algorithms, which work in slightly different ways with different strengths and weaknesses these walkthroughs and examples cover some basic rules for using classification algorithms and how you can take advantage of these models in {{ocient}} it can be very challenging to anticipate which classification model is best suited for a data set often, it is best to experiment with multiple different models to see what happens logistic regression despite its name, logistic regression is a classification algorithm, not a regression algorithm logistic regression is generally a good first step for trying out a classification model before moving on to more advanced options logistic regression shares many restrictions with the support vector machines model both models require that all features are numeric however, you can still use non numeric data if you convert it using one hot encoding the class labels can be any data type use one hot encoding to convert non numeric data into numeric form for example, if a feature is a color (such as red, green, or blue), then one hot encoding can make three columns named is red , is green , and is blue for each row, set one color to the value 1 and the other colors to 0 this example uses the lr input table, which represents the academic performance of students create table mldemo lr input as ( select hours studied, grade in prev course, case when rand() > (1 0 effective score) then true else false end as got top marks from ( select hours studied, grade in pr ev course, (2 grade in prev course + hours studied) / 10 0 as effective score from ( select mod(a c1, 10) as hours studied, mod(b c1, 5) as grade in prev course from sys dummy100 a, sys dummy100 b ) ) ); there are two features the number of hours spent studying hours studied and the grade in the previous course grade in prev course both input features are numeric there are also two classes true and false class values do not need to be boolean, and they can be any distinct values students who have better grades in the previous course and spend more hours studying are more likely to have top marks in the course however, this table also includes a randomness factor, which means that rows with the same features can have different classifications this is normal in real world data this example creates the logistic regression model on the table, including the metrics option this option causes the model to calculate the percentage of samples that are correctly classified the ocient system puts this information into the system catalog tables create mlmodel lr1 type logistic regression on ( select from mldemo lr input ) options('metrics' >'true'); this query selects the machine learning models and logistic regression models system catalog tables select name, on select, num arguments, zero case, one case, correctly classified from sys machine learning models a, sys logistic regression models b where a id = b machine learning model id and name = 'lr1'; name on select num arguments zero case one case correctly classified \ lr1 select from mldemo lr input 2 cast('true' as boolean) cast('false' as boolean) 0 8397 fetched 1 row the model captures the two values for the two target classes the correctly classified column shows the model has almost 84 percent accuracy from working on the training data this level of accuracy is pretty good, especially for a model using training data with significant randomness this model can predict which classification is most likely if you provide the number of hours studied and grades in the previous course first, review the average hours spent studying for each classification select avg(hours studied) from mldemo lr input where got top marks; (sum(hours studied)) /(null if(double(count(hours studied)), (0))) \ 5 231264244536801 fetched 1 row select avg(hours studied) from mldemo lr input where not got top marks; (sum(hours studied)) /(null if(double(count(hours studied)), (0))) \ 2 3534041715859897 fetched 1 row as expected, the rows where got top marks is true have a higher average value for hours studied the same should be true with grade in prev course select avg(grade in prev course) from mldemo lr input where got top marks; (sum(grade in prev course)) /(null if(double(count(grade in prev course)), (0))) \ 2 3535326451266925 fetched 1 row select avg(grade in prev course) from mldemo lr input where not got top marks; (sum(grade in prev course)) /(null if(double(count(grade in prev course)), (0))) \ 0 9622195985832349 fetched 1 row you can also query to find the rows that are anomalies in most cases, students achieved top marks in their class despite having minimal study hours and low grades in their previous classes select from mldemo lr input where got top marks != lr1(hours studied, grade in prev course) limit 10; hours studied grade in prev course got top marks \ 1 1 true 1 1 true 1 4 false 1 1 true 1 2 true 1 0 true 1 2 true 1 1 true 1 2 true 1 1 true fetched 10 rows for details about using this model type, see classification models docid\ i jn69mztmmumjrembhsn support vector machine support vector machine (svm) tries to find a hyperplane to divide data into two classes with these objectives the hyperplane has the maximum margin, meaning the most distance between the two classes the model correctly classifies the highest percentage of data points by default, svm models in ocient try to balance these two objectives you can increase the priority of either objective by using the regularizationcoefficient option for data sets with two features, the svm hyperplane represents a straight line on a cartesian graph for three features, the hyperplane represents a plane this example uses the svm model on the same classification data set of student performance as used in the logistic regression section create mlmodel svm1 type support vector machine on ( select from mldemo lr input ) options('metrics' >'true'); modified 0 rows the machine learning models and support vector machine models system catalog tables for svm contain almost the same information as logistic regression select name, on select, num arguments, negative case, positive case, correctly classified from sys machine learning models a, sys support vector machine models b where a id = b machine learning model i d and name = 'svm1'; name on select num arguments negative case positive case correctly classified \ svm1 select from mldemo lr input 2 cast('false' as boolean) cast('true' as boolean) 0 8397 fetched 1 row the system catalog tables indicate the same percentage of correctly classified rows as the logistic regression model (about 84 percent) the svm model can increase the accuracy further by using a kernel a kernel is a transformation to make data linearly separable svm in ocient allows you to specify any kernel of your choice for example, suppose that a circle can separate two classes of data, dividing data points between those inside and outside the circle the svm model cannot find a line to divide the two classes very well, or this problem is not linearly separable however, you can transform the data into three dimensions (three features) to make it linearly separable this is the idea of a kernel you specify the kernel using the options function1 , function2 , function3 , etc each function defines the formula for the kernel to create new features functions can use features from the original data set, represented by x1 , x2 , x3 , etc to explain this, revisit the circle example suppose that the circle centered at the point (2, 2) is the class inside otherwise, the circle is the class outside the data that is separable by a circle in two dimensions is linearly separable in three dimensions with the features x1 , x2 , x1^2 + x2^2 this statement creates the data and the model create table mldemo circle as ( select x, y, case when sqrt((x 2) ^ 2 + (y 2) ^ 2) <= 1 then 'inside' else 'outside' end as label from ( select rand() 10 as x, rand() 10 as y from sys dummy10000 ) ); modified 10000 rows create mlmodel svm2 type support vector machine on ( select x, y, label from mldemo circle ) options( 'metrics' >'true', 'function1' >'x1', 'function2' >'x2', 'function3' >'x1^2 + x2^2' ); modified 0 rows the example includes three functions, meaning that the kernel increases from two to three dimensions the kernel functions refer to the original feature values x1 and x2 if you examine the correctly classified value in the machine learning models and support vector machine models system catalog tables, you can see a model that classifies nearly everything correctly select name, negative case, positive case, correctly classified from sys machine learning models a, sys support vector machine models b where a id = b machine learning model id and name = 'svm2'; name negative case positive case correctly classified \ svm2 cast('outside' as char) cast('inside' as char) 0 9913 fetched 1 row one advantage of svm kernels is that they are automatic the model saves all the kernel information and transforms the data as needed hence, you can execute this model to predict a new point without redefining the kernel you can pass in the two features the model was built over, and a prediction comes out the sys machine learning model options table is the location where the svm model saves kernel information this table saves all the options that you specify some models also add additional keys to this dictionary to save certain information needed for when the model is executed select svm2(2, 2); coalesce(('inside'), (null)); \ inside fetched 1 row for details about using this model type, see classification models docid\ i jn69mztmmumjrembhsn binary classification with neural networks a neural network using the feedforward network model type can perform binary classification to use this model for classification, you must either c onvert labels to either 0 and 1 (using the log loss option) or convert labels to 1 and 1 (using the hinge loss option) both options can achieve the same result this model can handle more complex cases, but using a different model for linearly separable data might be easier in other words, a neural network model is useful when data is difficult to separate linearly using a kernel this example shows a feedforward network model using the log loss option note the case statement to convert labels to 0 or 1 create mlmodel logloss type feedforward network on ( select x, y, case when label = 'inside' then 0 else 1 end as target from mldemo circle ) options( 'metrics' >'true', 'hiddenlayers' >'2', 'hiddenlayersize' >'2', 'outputs' >'1', 'lossfunction' >'log loss' ); modified 0 rows this query finds that the classification accuracy is about 97 percent select count( ) / 10000 0 from ( select x, y, label, case when logloss(x, y) = 0 then 'inside' else 'outside' end as predicted from mldemo circle ) where label = predicted; ( count( ) 0)/((10000 0)) \ 0 9698 fetched 1 row to use the hinge loss option, use 1 and 1 in the case statement create mlmodel hingeloss type feedforward network on ( select x, y, case when label = 'inside' then 1 else 1 end as target from mldemo circle ) options( 'metrics' >'true', 'hiddenlayers' >'2', 'hiddenlayersize' >'2', 'outputs' >'1', 'lossfunction' >'hinge loss' ); modified 0 rows the accuracy of the hinge loss version is exactly the same as the log loss version select count( ) / 10000 0 from ( select x, y, label, case when hingeloss(x, y) = 1 then 'inside' else 'outside' end as predicted from mldemo circle ) where label = predicted; ( count( ) 0)/((10000 0)) \ 0 9698 fetched 1 row for details about using this model type, see other models docid 3yrttxa4mtacybwk7auxf k nearest neighbors k nearest neighbors (knn) can only use numeric features, but knn handles multi classification in the ocient system this algorithm can handle an arbitrary number of target classes one drawback with knn is that it does not work at training time other than to take a snapshot of the training data the algorithm operates on the data when you execute the model, so it can be a slow model for large data sets to demonstrate knn functionality, this example uses a similar circle of size (2, 2) with some modifications to use more classes data within a unit circle of (2, 2) is classified as inside d ata within a radius of 2 is classified as middle any other data is classified as outside create table mldemo circles as ( select x, y, case when sqrt((x 2) ^ 2 + (y 2) ^ 2) <= 1 then 'inside' when sqrt((x 2) ^ 2 + (y 2) ^ 2) <= 2 then 'middle' else 'outside' end as label from ( select rand() 10 as x, rand() 10 as y from sys dummy10000 ) ); modified 10000 rows knn works by finding the k nearest points to the one the model is trying to predict a classification for k is an integer value that you must specify during model creation the model calculates distances using normal euclidian distances you can override the definition of distance by specifying a formula using the distance option after the model identifies the points, it computes a score for each class and returns the class with the highest score the weight option can influence each score for the specified point by default, the weight value is based on the inverse of the distance ( 1 0/(d+0 1) ) in other words, a point closer to the point you are trying to predict has a larger influence on the output result than a point further away any point outside of the nearest k points has no influence this example uses three classes create mlmodel knn1 type knn on ( select x, y, label from mldemo circles ) options('k' >'3'); modified 0 rows these queries test some values to see if the classifications are correct select knn1(2, 2); \ inside fetched 1 row select knn1(0 5, 2); \ middle fetched 1 row select knn1(5,5); \ outside fetched 1 row knn models perform work when you execute the model, and the amount of work increases as the input data increases one solution to this is data reduction, which means finding a smaller subset of rows to build the knn model over, enough to ensure that the accuracy of the model is still high this method front loads the work when you create the model so that it is much faster to execute on new data to use the data reduction feature, set the datareduction option to true by default, the execution stops when the database finds a model that is 90 percent accurate over the training data or when the reduced model increases to 1,000 rows of data you can override these defaults by setting the maxreducedrows or targetaccuracy options additionally, if you are using data reduction, you can ask the database to collect metrics on the percent of rows correctly classified by the model because this is much faster to do on the reduced data set this example builds another knn model over the same data and uses data reduction to find a model that is at least 95 percent accurate create mlmodel knn2 type knn on ( select x, y, label from mldemo circles ) options( 'k' >'3', 'datareduction' >'true', 'metrics' >'true', 'targetaccuracy' >'0 95' ); modified 0 rows query the machine learning models and k nearest neighbor models system catalog tables to see the snapshot table used for the model and its accuracy select table name, correctly classified from sys machine learning models a, sys k nearest neighbor models b where a id = b machine learning model id and name = 'knn2'; table name correctly classified \ temp knn86947930347148676 0 9541 fetched 1 row the model with data reduction correctly classifies about 95 percent of the training data to see the size of the knn model snapshot, you can query it by its table name select count( ) from temp knn86947930347148676; count( ) \ 300 fetched 1 row the snapshot is only 300 rows without data reduction, the row count would be much larger by orders of magnitude for details about using this model type, see classification models docid\ i jn69mztmmumjrembhsn multi class classification with neural networks the feedforward network model type can also perform multi class classification when the feature classes are numeric multi class classification works by using vectors, where each position in the vector indicates the score for that class the argmax value of the score of each vector determines its class much of this example is similar to the previous classification analysis docid 4aqtvew3csjlb3ayp znb tutorial with these modifications to the model options set the lossfunction option to cross entropy set the outputs option to specify the number of classes for the model to use (in this example 3 ) set the usesoftmax option to true , which applies a softmax function to the final vector when the loss function is cross entropy loss , the option usesoftmax defaults to true create mlmodel crossentropy type feedforward network on ( select x, y, case when label = 'inside' then { { 1 0, 0 0, 0 0 } } when label = 'middle' then { { 0 0, 1 0, 0 0 } } else { { 0 0, 0 0, 1 0 } } end as target from mldemo circles ) options( 'metrics' >'true', 'hiddenlayers' >'2', 'hiddenlayersize' >'2', 'outputs' >'3', 'lossfunction' >'cross entropy loss', 'usesoftmax' >'true' ); modified 0 rows execute this model to return a vector select crossentropy(5,5) as predicted; predicted \ \[\[1 3436428911280516e 7, 1 811726873154417e 7, 0 9999996844630236]] fetched 1 row to get the classification, use the vector argmax function to retrieve the index of the largest value matrixes index the first row and first column with 1 however, the vector argmax function starts the count at 0 select vector argmax(crossentropy(5,5)) as predicted; predicted \ 2 fetched 1 row this query finds the overall accuracy of the model select count( ) / 10000 0 from mldemo circles where ( label = 'inside' and vector argmax(crossentropy(x, y)) = 0 ) or ( label = 'middle' and vector argmax(crossentropy(x, y)) = 1 ) or ( label = 'outside' and vector argmax(crossentropy(x, y)) = 2 ); ( count( ) 0)/((10000 0)) \ 0 9298 fetched 1 row the model is about 93 percent effective you can improve the accuracy by modifying the other models docid 3yrttxa4mtacybwk7auxf , including hiddenlayers hiddenlayersize lossfuncnumsamples increasing the resources of these options can improve accuracy, but could increase the time needed to train the model for details about using this model type, see other models docid 3yrttxa4mtacybwk7auxf multi class classification with support vector machines and logistic regression both logistic regression and svm are binary classifiers, but you can use either to build a multi class classifier to handle any number of classes ocientml can perform this automatically with either model this example shows the create statement for the svm model, which uses the same data as the classification analysis docid 4aqtvew3csjlb3ayp znb and classification analysis docid 4aqtvew3csjlb3ayp znb examples create mlmodel circ1 type support vector machine on ( select x, y, label from mldemo circles ) options( 'metrics' >'true', 'function1' >'x1', 'function2' >'x2', 'function3' >'x1^2 + x2^2' ); modified 0 rows this statement uses the same kernel as the example for performing binary classification on classification analysis docid 4aqtvew3csjlb3ayp znb this query checks the metrics in the machine learning models and support vector machine models system catalog tables to see the accuracy select correctly classified from sys machine learning models a, sys support vector machine models b where a id = b machine learning model id and name = 'circ1'; correctly classified \ 0 9579 fetched 1 row this model is slightly more accurate than the neural network classifier this example tries logistic regression create mlmodel circ2 type logistic regression on ( select x, y, label from mldemo circles ) options('metrics' >'true'); modified 0 rows select correctly classified from sys machine learning models a, sys logistic regression models b where a id = b machine learning model id and name = 'circ2'; correctly classified \ 0 9028 fetched 1 row logistic regression shows lower accuracy than the neural network model these classification models can handle non numeric data naive bayes the first of these models is naive bayes this model can handle binary or multi class classification and features of any data type by default, naive bayes treats each feature as a discrete value for example, if you assign features with values of 1 , 2 , 3 , and 4 , the model does not give any numeric significance to them in this case, this model cannot predict the class for the sample with a feature value of 3 5 because that value is not explicitly assigned as a feature however, the naive bayes model can define certain input features as continuous features these should be numeric values in this case, the naive bayes model finds the best fit normal distribution for that feature to do the probability calculations naive bayes models essentially use the probability rules of bayes' theorem to compute the most likely class based on the training data the rules are only correct if the features are independent (uncorrelated) however, in practice, the models tend to work well with real world data this example uses this simple table to demonstrate naive bayes models select from mldemo cars; color type origin stolen \ yellow sports domestic false yellow sports imported true yellow suv imported false red sports domestic false yellow suv domestic false red suv imported false red sports domestic true yellow suv imported true red sports imported true red sports domestic true fetched 10 rows color , type , and origin are the input features, which are all strings true and false are the class labels in this case, all of the input features are discrete this example builds the naive bayes model create mlmodel nb1 type naive bayes on ( select color, type, origin, stolen from mldemo cars ) options('metrics' >'true'); modified 0 rows this query examines the information in the machine learning models and naive bayes models system catalog tables for naive bayes models select name, on select, num arguments, result probability table, feature result matrix table, correctly classified from sys machine learning models a, sys naive bayes models b where a id = b machine le arning model id and name = 'nb1'; name on select num arguments result probability table feature result matrix table correctly classified \ nb1 select color, "type", origin, stolen from user nb input 3 temp rpt83581840700868656 temp frm83581860860782808 0 8 fetched 1 row in the model, the correctly classified value indicates the accuracy is 80 percent, meaning just two rows are classified incorrectly the two table names in the temp schema hold all the necessary data for executing the model as a function this query calls the model to classify new data select nb1('red', 'suv', 'domestic') as prediction; prediction \ false fetched 1 row here is another example with a different data set to show how continuous features work with naive bayes models select from mldemo golf; humidity play \ 70 false 75 true 86 true 95 false 80 true 85 false 96 true 90 true 91 false 90 false 70 true 70 true 80 true 65 true fetched 14 rows in this case, the only input feature is a continuous feature, but you can have multiple continuous features, and you can have a mix of continuous and discrete features to build a naive bayes model over this data, you must set which input features are continuous in the model create mlmodel nb2 type naive bayes on ( select from mldemo golf ) options('continuousfeatures' >'1', 'metrics' >'true'); modified 0 rows the continuousfeatures option is a comma separated list (no spaces) of input column indexes that are continuous indexes start at 1 this query checks the correctly classified value in the machine learning models and naive bayes models system catalog tables select correctly classified from sys machine learning models a, sys naive bayes models b where a id = b machine learning model id and name = 'nb2'; correctly classified \ 0 7142857142857143 fetched 1 row naive bayes is a powerful classification algorithm, but it operates as a black box, making it hard to assess how it makes classification decisions for details, see classification models docid\ i jn69mztmmumjrembhsn decision trees unlike naive bayes, the decision tree model allows you to see how the model works and makes its decisions a decision tree branches based on the value of a single feature, while the leaves of the tree are the output classes for sql syntax, this model uses multiple nested case statements you can examine the whole nested case statement that defines the model, but it might be very long the decision tree model in ocient has limited support for continuous features the model finds the average value of the feature and allows a two way branch based on whether the value is lower or higher than the average creating a decision tree model is exactly like a naive bayes model, except with a different type decision trees also use the continuousfeatures option, which functions the same these decision tree examples use the same data sets as the naive bayes examples create mlmodel dt1 type decision tree on ( select color, "type", origin, stolen from mldemo cars ) options('metrics' >'true'); modified 0 rows this query examines the machine learning models and decision tree models system catalog tables select correctly classified from sys machine learning models a, sys decision tree models b where a id = b machine learning model id and name = 'dt1'; correctly classified \ 0 8 fetched 1 row this model has the same 80 percent accuracy as the naive bayes model this query examines the case statement that defines the model select case statement from sys machine learning models a, sys decision tree models b where a id = b machine learning model id and name = 'dt1'; case statement \ case when x3 = cast('domestic' as char) then case when x1 = cast('red' as char) then case when x2 = cast('sports' as char) then cast('true' as boolean) else cast('false' as boolean) end when x1 = cast('yellow' as char) then cast('false' as boolean) else cast('false' as boolean) end when x3 = cast('imported' as char) then case when x2 = cast('suv' as char) then case when x1 = cast('red' as char) then cast('false' as boolean) when x1 = cast('yellow' as char) then cast('true' as boolean) else cast('false' as boolean) end when x2 = cast('sports' as char) then cast('true' as boolean) else cast('false' as boolean) end else cast('false' as boolean) end fetched 1 row like with the naive bayes model, the decision tree model can predict new data select dt1('red', 'suv', 'domestic') as prediction; prediction \ false fetched 1 row this example builds a model over the golfing data create mlmodel dt2 type decision tree on ( select from mldemo golf ) options('continuousfeatures' >'1', 'metrics' >'true'); modified 0 rows review the machine learning models and decision tree models system catalog tables to see the accuracy of this model select correctly classified from sys machine learning models a, sys decision tree models b where a id = b machine learning model id and name = 'dt2'; correctly classified \ 0 7142857142857143 fetched 1 row again, the decision tree shows similar accuracy to the naive bayes model related links machine learning model functions docid\ jsgwuw5og56fzrve5h10g classification models docid\ i jn69mztmmumjrembhsn