# Classification Models

Classification models involve understanding and grouping large data sets into preset categories or subpopulations. With the help of pre-classified training data sets, machine learning classification models leverage a wide range of algorithms to classify future data sets into respective and relevant categories.

Ocient-supported classification models include the following:

Model Type: KMEANS

K-means is an unsupervised clustering algorithm. All of the columns in the input result set are features and there is no label. All the input columns must be numeric. The algorithm finds k points such that all points are classified by the closest k points. Distance calculations are Euclidean by default.

Because there are no labels for clusters, when you execute this function after training with the same number (and same order) of features as input, the result is an integer that specifies the cluster to which the point belongs.

k - This option must be a positive integer. k specifies the algorithm how many clusters to make.

epsilon - If you specify this option, the value must be a valid positive floating point value. When the maximum distance that a centroid moves from one iteration of the algorithm to the next is less than this value, the algorithm terminates. This parameter defaults to 1e-8.

Model Type: KNN

KNN is a classification algorithm, where the first N - 1 inputs are the features, which must be numeric. The last input column is a label, which can be any data type.

There is no training step for KNN. Instead, when you create the model, the model saves a copy of all input data to a table, so that when the model is executed in a later SQL statement, a snapshot of the data the model is supposed to use is available. You can override both the weight function and the distance function.

When you execute the model, the model executed with N - 1 features as input and returns a label. The model chooses the label from the class with the highest score. The model scores classes by summing the weights from the nearest k points in the training data.

k - This option must be a positive integer that specifies how many closest points to use for classifying a new point.

distance - If you specify this option, the value must be a function in SQL syntax for calculating the distance between a point that is used for classification and points in the training data set. This function should use the variables x1, x2, … for the 1st, 2nd, … features in the training data set, and p1, p2, … for the features in the point for classification. If you do not specify this option, the option defaults to the Euclidian distance.

weight - If you specify this option, the value must be a function in SQL syntax for calculating the weight for a neighbor. The function should use the variable d for distance. By default, the distance is set to 1.0/(d+0.1), thus avoiding division by zero on exact inputs and still allowing neighbors to have some influence.

Model Type: NAIVE BAYES

Naive Bayes is a classification algorithm. The input is N - 1 feature columns, and the last column is a label column. All columns can be any data type. The label column must be discrete. The feature columns can be discrete or continuous. When you use continuous feature columns, you must specify which columns are continuous (see options).

Naive Bayes works by assuming that all features are equally important in the classification and that there is no correlation between features. With those assumptions, the algorithm computes all frequency information and saves it in three tables that you create using SQL SELECT statements.

When you execute the model, you specify N - 1 feature input arguments and the model returns the most likely class. The returned class is based on computing the class with the highest probability, given prior knowledge of the feature values. In other words, the class y has the highest value of P(y | x1, x2, …, xn).

metrics - If you set this option to true, the model calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. This option defaults to false.

continuousFeatures - If you specify this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start at 1.

Model Type: DECISION TREE

Decision tree is a classification model. The first N - 1 input columns are features and can be any data type. All non-numeric features must be discrete and must contain no more than the configured distinctCountLimit number of unique values. This limit is in place to prevent the internal model representation from growing too large. Numeric features are discrete by default and have the same limitation on the number of unique values, but they can be marked as continuous with the continuousFeatures option. For continuous features, the model builds the decision tree by dividing the values into two ranges instead of using discrete, unique values. The last input column is the label and can be any data type. The label must also have no more than the distinctCountLimit number of unique values.

When you create the model, you specify all features first, and then you specify the label as the last column in the result set.

You can use secondary indexes on discrete feature columns to greatly speed up training of a decision tree model.

When you execute the model, you must specify the N - 1 features as parameters. The model returns the expected label.

metrics - If you set this option to true, the model also calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. This option defaults to false.

continuousFeatures - If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1.

distinctCountLimit - If you set this option, the value must be a positive integer. This value sets the limit for how many distinct values a non-continuous feature and the label can contain. This option defaults to 256.