SQL Reference
Machine Learning Model Functio...
Classification Models
{{ocientml}} supports c lassification models that involve understanding and grouping large data sets into preset categories or subpopulations with the help of pre classified training data sets, machine learning classification models leverage various algorithms to classify future data sets into respective and relevant categories to create the model, use the create mlmodel syntax for details, see docid\ dsqzzzggetdfwh5qasvhg model option names are case sensitive k nearest neighbors classification model type k nearest neighbors k nearest neighbors ( knn) is a classification algorithm, where the first n 1 inputs are the features, which must be numeric the last input column is a label, which can be any data type there is no training step for knn instead, when you create the model, the model saves a copy of all input data to a table, so that when the model is executed in a later sql statement, a snapshot of the data the model is supposed to use is available you can override both the weight function and the distance function model options required k — this option must be a positive integer that specifies how many closest points to use for classifying a new point optional datareduction — if you specify this option to true , you enable the data reduction feature that finds a smaller subset of rows to build the knn model over, enough to ensure that the accuracy of the model is still high by default, the execution stops when the database finds a model that is 90 percent accurate over the training data or when the reduced model increases to 1,000 rows of data you can override these defaults by setting the maxreducedrows or targetaccuracy options distance — if you specify this option, the value must be a function in sql syntax for calculating the distance between a point used for classification and points in the training data set this function should use the variables x1, x2, … for the 1st, 2nd, … features in the training data set, and p1, p2, … for the features in the point for classification if you do not specify this option, the option defaults to the euclidian distance maxreducedrows — if you specify this option, then you must set the datareduction option to true specify the maximum number of rows for data reduction as a positive integer the default value is 1,000 rows of data metrics — if you set this option to true , the model calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table the default value is false normalize — if you set this option to true , the model normalizes the data before the start of training normalization means to shift the mean to 0, and scale the data so the standard deviation is 1 targetaccuracy — if you specify this option, then you must set the datareduction option to true specify the model accuracy over the training data with a double value in the range (0,1] the default value is 90 percent accuracy, or 0 90 weight — if you specify this option, the value must be a function in sql syntax for calculating the weight of a neighbor the function should use the variable d for distance by default, the distance is set to 1 0/(d+0 1), thus avoiding division by zero on exact inputs and still allowing neighbors to have some influence loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a knn model with 8 closest points for classification and distance function power(x1 p1, 2) + power(x2 p2, 2) + power(x3 p3, 2) create mlmodel my model type k nearest neighbors on ( select x1, x2, x3, y1 from public my table ) options( 'k' > '8', 'distance' > 'power(x1 p1, 2) + power(x2 p2, 2) + power(x3 p3, 2)' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model, the model executes with n 1 features as input and returns a label the model chooses the label from the class with the highest score the model scores classes by summing the weights from the nearest k points in the training data select my model(x1, x2, x3) from my table; after you execute a model, you can find the details about the execution results in the sys k nearest neighbors models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in the https //docs ocient com/system catalog#pfo j naive bayes classification model type naive bayes naive bayes is a classification algorithm the input is n 1 feature columns, and the last column is a label column all columns can be any data type the label column must be discrete the feature columns can be discrete or continuous when you use continuous feature columns, you must specify which columns are continuous (see options) naive bayes works by assuming that all features are equally important in the classification and that there is no correlation between features with those assumptions, the algorithm computes all frequency information and saves it in three tables you create using sql select statements model options optional metrics — if you set this option to true , the model calculates the percentage of samples correctly classified by the model and saves this information in a catalog table this option defaults to false continuousfeatures — if you specify this option, the value must be a comma separated list of the feature indexes that are continuous numeric variables indexes start at 1 loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a naive bayes model with feature indexes 1,3 create mlmodel my model type naive bayes on ( select x1, x2, x3, y1 from public my table ) options( 'continuousfeatures' > '1,3' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model, you specify n 1 feature input arguments and the model returns the most likely class the returned class is based on computing the class with the highest probability, given prior knowledge of the feature values in other words, the class y has the highest value of p(y | x1, x2, …, xn) select my model(col1, col2, col3) from my table; after you execute a model, you can find the details about the execution results in the sys naive bayes models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in https //docs ocient com/system catalog#pfo j decision tree model type decision tree the decision tree is a classification model the first n 1 input columns are features and can be any data type all non numeric features must be discrete and contain no more than the configured distinctcountlimit number of unique values this limit is in place to prevent the internal model representation from growing too large numeric features are discrete by default and have the same limitation on the number of unique values, but they can be marked as continuous with the continuousfeatures option for continuous features, the model builds the decision tree by dividing the values into two ranges instead of using discrete, unique values the last input column is the label and can be any data type the label must also have no more than the distinctcountlimit number of unique values when you create the model, you specify all features first, and then specify the label as the last column in the result set you can use secondary indexes on discrete feature columns to greatly speed up the training of a decision tree model model options optional continuousfeatures — if you set this option, the value must be a comma separated list of the feature indexes that are continuous numeric variables indexes start with 1 in the default state, the model considers no features as continuous distinctcountlimit — if you set this option, the value must be a positive integer this value sets the limit for how many distinct values a non continuous feature and the label can contain this option defaults to 256 doprune — if you set this option to true , the model uses pessimistic error pruning (pep) to prune the tree after training this option defaults to false featurearray — if you set this option to true , all features must be present in the one array column and all arrays in that column have the same length this option defaults to false featurearrayelements — if you set this option, the value must be a comma separated list of the feature indexes if you set the featurearray option to true , this list determines the elements of the arrays to use for training by default, the model uses all elements maxcellstofetch — if you set this option, the value must be a positive integer this value d etermines the memory threshold to switch from training with system memory to training with sql queries in the database in memory training is generally faster, but is limited by the available sql node memory if the size of a training data subset exceeds this value, then the system performs training operations using sql queries this value defaults to 33,554,432 (calculated as 32 1024 1024) maxdepth — if you set this option, the value must be a positive integer this value sets the maximum allowable depth of the decision tree (the maximum number of features to split on) the default is unspecified, which means there is no maximum depth maxrows — if you set this option, the value must be a positive integer the decision tree trains using a subset of data with no more than the specified number of rows if you set this option, ensure that the model does not select different subsets of data in different parts of the training by not setting the nosnapshot option maxthreads — if you set this option, the value must be a positive integer this value indicates the maximum number of parallel threads to use while the model trains the default value is 1 metrics — if you set this option to true , the model also calculates the percentage of samples correctly classified by the model and saves this information in a catalog table this option defaults to false nosnapshot — if you set this option to true , the database does not create an intermediate table that stores the result of the specified sql statement, which the model uses for training this option defaults to false in this case, the database creates and uses the intermediate table setting this option to true is useful when the training set is fixed if the training set is a table with modifications, set this option to false as the decision tree trainer uses different data sets in different parts of the tree likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that running that query twice generates the same 100 rows each time numsplits — if you set this option, the value must be an integer greater than 1 this value sets the maximum number of branches a continuous feature can have the default value is 32 rocnumsamples — if you set the option, you must also set the metrics option this positive integer indicates the number of samples for the model to use for the area under the roc curve the default value is 10 skiplimitcheck — if you set this option to true , the model skips cardinality checks that throw errors when columns have too many values the limit that this option checks is the same one that is specified by the distinctcountlimit option this option defaults to false weighted — if you set this option, the model considers weights for labels if you set this option value to true , you must specify an additional column as a double in the training data for label weights rows with the same labels must have the same weights if you set this value to auto , the model calculates weights automatically by weighing each label according to the ratio of the count of the most frequent label to the count of the specified label as a result, the most frequent label has a weight of 1 0 , and the other label weights are higher this option defaults to false , which means all labels have equal weight loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a decision tree model with feature indexes 1,3 create mlmodel my model type decision tree on ( select c1, c2, c3, y1 from public my table ) options( 'continuousfeatures' > '1,3' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model, you must specify the n 1 features as parameters the model returns the expected label select my model(col1, col2, col3) from my table; after you execute a model, you can find the details about the execution results in the sys decision tree models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in https //docs ocient com/system catalog#pfo j random forest model type random forest r andom forest is a classification model consisting of multiple decision trees the result consists of the most common label among the tree results and an array of pairs of labels and their frequencies, sorted in descending order by frequency the model breaks ties in various ways depending on the type of label for strings, the model uses lexicographic order (reverse of the usual alphabetic order), so c comes before a , for example for booleans, the model chooses true before false for numeric types, the model chooses the largest number when you call the model to make classification predictions, you can optionally use soft voting by adding an extra boolean argument to the statement, e g , house(x, y, true) this extra argument must be a boolean literal, either true (soft voting) or false (hard voting) if you do not specify this value, the default value is false (hard voting) in soft voting, each decision tree reports a list of possible results and their confidence factors the random forest model adds the confidence factors from the decision trees and normalizes them to add up to 1 0 the model sorts the results by descending confidence, then by descending values in hard voting, the model does not account for confidence factors each decision tree makes its own class prediction as a vote the model selects the prediction with the most votes from the trees model options required numchildren — number of child decision trees optional continuousfeatures — if you set this option, the value must be a comma separated list of the feature indexes that are continuous numeric variables indexes start with 1 distinctcountlimit — if you set this option, the value must be a positive integer this value limits how many distinct values a non continuous feature and the label can contain the default value is 256 featurearray — if you set this option to true , all features must be present in one array column where all arrays are the same length the default value is false fractionselected — the proportion of rows the model uses to train each decision tree the value is a double that must be in the interval (0, 1] you cannot set this option if you also set the rowsperchild option to a positive value the default behavior is that the model uses all available rows inputsperchild — number of features used to create each child decision tree the default value is the number of features you specify for the forest divided by 3 and rounded up maxdepth — if you set this option, the value must be a positive integer this value sets the maximum allowable depth of the decision tree the default value is 3 maxthreads — the maximum number of parallel threads to use while the model trains decision trees this value must be a positive integer the default value is 16 metrics — if you set this option to true , the model also calculates the percentage of samples that are correctly classified by the model for the random forest and saves this information in a catalog table this option is always set to false for the child trees nosnapshot — if you set this option to true , the data source must not change in this case, the database does not create an intermediate table that stores the result of the specified sql statement, which the model uses for training a random forest child decision trees always have this option set to true , so the database does not create a separate intermediate table for each decision tree the default value is false setting this option to true is useful when the training set is fixed if the training set is a table with modifications, set this option to false as the decision tree trainer uses different data sets in different parts of the tree likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that running that query twice generates the same 100 rows each time requiredfeatures — a comma separated list of integers as strings representing specific features where the first feature has the value 1 the model uses these features in every decision tree in the forest the default behavior is that the decision tree in the forest can train on any feature that is in the list rowsperchild — if you set this option to a positive integer, the number represents the number of rows (from a random sample) to use for each decision tree if you set this option to 0, each child uses all available rows the default value is 0 you cannot set this option to a positive value if you also set the fractionselected option skiplimitcheck — if you set this option to true , the model skips cardinality checks that throw errors when columns have too many values the limit that this option checks is the same one that is specified by the distinctcountlimit option this option defaults to false loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a random forest model with four child decision trees and two features randomly chosen by the model for each tree the training table table contains the training data set collect metrics for the model execution by setting the metrics option to true create mlmodel test model type random forest on ( select from public training table ) options( 'numchildren' > '4', 'inputsperchild' > '2' ); after you create the model, you can see its details by querying the sys machine learning models , sys machine learning model options , and sys random forest models system catalog tables e xecute this model using three columns a , b , and c , from large table table that contains the whole data set select test model(a, b, c) from large table; optionally, you can execute the model using soft voting by including an extra boolean argument select test model(a, b, c, true) from large table; the output of the model is a tuple containing the most common label as the first element the second element is an array of tuples, where the first element is the label and the second is the frequency the model sorts the array by frequency from highest to lowest, where the sum of frequencies is 1 when two labels have the same frequency, the model breaks the tie according to lexicographic order to retrieve the most common label, use \[] with an index of 1, \[1] , for example, test model(a,b,c)\[1] after you execute a model, you can find the results in the output of the model function execution for details, see the description of the associated system catalog tables in the machine learning section in https //docs ocient com/system catalog#pfo j logistic regression model type logistic regression this model fits a logistic curve to the data across any number of classes greater than one the first n 1 inputs are features and must be numeric features can be one hot encoded the last input column is the class or label you must have at least one non null label in the result set for model creation the model best fits the logistic curve using a negative log likelihood loss function the model uses an algorithm that is a combination of particle swarm optimization, line search, and genetic algorithms to find the best fit parameters for faster, lower quality models, try reducing the popsize , initialiterations , and subsequentiterations options conversely, for slower, higher quality models, try increasing the values for these same options model options optional metrics — if you set this option to true , the model calculates the percentage of samples that are correctly classified by the model and saves this information in the sys logistic regression models system catalog table this option defaults to false popsize — if you specify this option, the value must be a positive integer this value sets the population size for the particle swarm optimization (pso) part of the algorithm this option defaults to 100 mininitparamvalue — if you specify this option, the value must be a floating point number this value sets the minimum for initial parameter values in the optimization algorithm this option defaults to 10 maxinitparamvalue — if you specify this option, the value must be a floating point number this value sets the maximum for initial parameter values in the optimization algorithm this option defaults to 10 initialiterations — if you specify this option, the value must be a positive integer this value sets the number of pso iterations for the first pso pass this option defaults to 500 subsequentiterations — if you specify this option, the value must be a positive integer this value sets the number of pso iterations for subsequent iterations of the pso algorithm this option defaults to 100 momentum — if you specify this option, the value must be a positive floating point this parameter controls how much pso iterations move away from the local best value to explore new territory this option defaults to 0 1 gravity — if you specify this option, the value must be a positive floating point this parameter controls how much pso iterations are drawn back towards the local best value this option defaults to 0 01 lossfuncnumsamples — if you specify this option, the value must be a positive integer this parameter controls how many points are sampled when estimating the loss function this option defaults to 1000 numgaattempts — if you specify this option, the value must be a positive integer this parameter controls how many ga crossover possibilities the model tries this option defaults to 10 million maxlinesearchiterations — if you specify this option, the value must be a positive integer this parameter controls the maximum allowed number of iterations when the model runs the line search part of the algorithm this option defaults to 200 minlinesearchstepsize — if you specify this option, the value must be a positive floating point this parameter controls the minimum step size the line search algorithm ever takes this option defaults to 1e 5 samplesperthread — if you specify this option, the value must be a positive integer this parameter controls the target number of samples sent to each thread each thread independently computes a logistic regression model, and the threads are all combined at the end this option defaults to 1 million loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a logistic regression model collect metrics for the model execution by setting the metrics option to true create mlmodel my model type logistic regression on ( select x1, x2, x3, y1 from public my table ) options( 'metrics' > 'true' ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute this model after training, you must specify the features as input and the label as the output the label can be any data type select my model(col1, col2, col3) from my table; after you execute a model, you can find the details about the execution results in the sys logistic regression models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in the https //docs ocient com/system catalog#pfo j support vector machine model type support vector machine support vector machine (svm) essentially finds a hypersurface (the hypersurface is a curve in 2 dimensional space) that correctly splits the data into any number of classes greater than one and maximizes the margin around the hypersurface by default, svm finds a hyperplane to split the data (the hyperplane is a straight line in 2 dimensional space) svm uses a hinge loss function to balance the two objectives of finding a hyperplane with a wide margin while minimizing the number of incorrectly classified points the first n 1 input columns are the features and must be numeric the last column is the label and can be any arbitrary type you must have at least one non null label in the result set for model creation for faster, lower quality models, reduce the popsize , initialiterations , and subsequentiterations options conversely, for slower, higher quality models, increase the values for these same options model options optional metrics — if you set this option to true, the model also calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table this option defaults to false regularizationcoefficient — if you specify this option, the value must be a valid floating point number this option is used to control the balance of finding a wide margin and minimizing incorrectly classified points in the loss function when this value is larger (and positive) it makes having a wide margin around the hypersurface more important relative to the incorrectly classified points because of how {{ocient}} implements svm, the values for this parameter are likely different than values used in other common svm implementations this option defaults to 1 0 / 1000000 0 functionn — by default, svm uses a linear kernel if you use a different kernel, you must provide a list of functions that are summed together, just like with linear combination regression you must specify the first function using a key named 'function1' subsequent functions must use keys with names that use subsequent values of n you must specify functions in sql syntax and use the variables x1, x2, …, xn to refer to the 1st, 2nd, and nth independent variables, respectively you can specify the default linear kernel as 'function1' → 'x1', 'function2' → 'x2', and so on the model always adds a constant term equivalent to 'functionn' → '1 0' that you do not need to specify explicitly popsize — if you set this option, the value must be a positive integer this value sets the population size for the particle swarm optimization (pso) part of the algorithm this option defaults to 100 mininitparamvalue — if you set this option, the value must be a floating point number sets the minimum for initial parameter values in the optimization algorithm this option defaults to 10 maxinitparamvalue — if you set this option, the value must be a floating point number sets the maximum for initial parameter values in the optimization algorithm this option defaults to 10 initialiterations — if you set this option, the value must be a positive integer sets the number of pso iterations for the first pso pass this option defaults to 500 subsequentiterations — if you set this option, the value must be a positive integer sets the number of pso iterations for subsequent pso iterations after the initial pass this option defaults to 100 momentum — if you set this option, the value must be a positive floating point number this parameter controls how much pso iterations move away from the local best value to explore new territory this option defaults to 0 1 gravity — if you set this option, the value must be a positive floating point number this parameter controls how much pso iterations are drawn back towards the local best value this option defaults to 0 01 lossfuncnumsamples — if you set this option, the value must be a positive integer this parameter controls how many points the model samples when estimating the loss function this option defaults to 1000 numgaattempts — if you set this option, the value must be a positive integer this parameter controls how many ga crossover possibilities the model tries this option defaults to 10 million maxlinesearchiterations — if you set this option, the value must be a positive integer this parameter controls the maximum allowed number of iterations when running the line search part of the algorithm this option defaults to 200 minlinesearchstepsize — if you set this option, the value must be a positive floating point number this parameter controls the minimum step size the line search algorithm ever takes this option defaults to 1e 5 samplesperthread — if you set this option, the value must be a positive integer number this parameter controls the target number of samples the model sends to each thread each thread independently computes a logistic regression model, and the models are all combined at the end this option defaults to 1 million loadbalance — if you set this option , the database appends the using load balance shuffle = \<value> clause to all intermediate sql queries the model executes during training where value is the specified option value ( true or false ) the default value is unspecified in this case, the database does not add this clause queryinternalparallelism — if you set this option , the database appends the using parallelism = \<value> clause to all intermediate sql queries the model executes during training where value is the specified positive integer value the default value is unspecified in this case, the database does not add this clause skipdroptable — if false , the database deletes any intermediate tables created during model training if true , the database prevents the deletion of any intermediate tables created during model training this option defaults to false execute the model create a support vector machine model create mlmodel my model type support vector machine on ( select c1, c2, c3, y1 from public my table ); after you create the model, you can see its details by querying the sys machine learning models and sys machine learning model options system catalog tables when you execute the model, the n 1 features must be passed as parameters the model returns the expected label select my model(col1, col2, col3) from my table; after you execute a model, you can find the details about the execution results in the sys support vector machine models system catalog table for details, see the description of the associated system catalog tables in the machine learning section in the https //docs ocient com/system catalog#pfo j related links docid\ tlugzlmibs6dstwsbeagw docid\ pahgmmey3oefcapzzu50i docid\ rqzgc5r0vigmehqdy3g5j