Classification Models
supports classification models that involve understanding and grouping large data sets into preset categories or subpopulations. With the help of pre-classified training data sets, machine learning classification models leverage a wide range of algorithms to classify future data sets into respective and relevant categories.
OcientML classification includes these models.
Model Type: K NEAREST NEIGHBORS
K-nearest neighbors (KNN) is a classification algorithm, where the first N - 1 inputs are the features, which must be numeric. The last input column is a label, which can be any data type.
There is no training step for KNN. Instead, when you create the model, the model saves a copy of all input data to a table, so that when the model is executed in a later SQL statement, a snapshot of the data the model is supposed to use is available. You can override both the weight function and the distance function.
When you execute the model, the model executed with N - 1 features as input and returns a label. The model chooses the label from the class with the highest score. The model scores classes by summing the weights from the nearest k points in the training data.
k - This option must be a positive integer that specifies how many closest points to use for classifying a new point.
distance - If you specify this option, the value must be a function in SQL syntax for calculating the distance between a point that is used for classification and points in the training data set. This function should use the variables x1, x2, … for the 1st, 2nd, … features in the training data set, and p1, p2, … for the features in the point for classification. If you do not specify this option, the option defaults to the Euclidian distance.
weight - If you specify this option, the value must be a function in SQL syntax for calculating the weight for a neighbor. The function should use the variable d for distance. By default, the distance is set to 1.0/(d+0.1), thus avoiding division by zero on exact inputs and still allowing neighbors to have some influence.
Model Type: NAIVE BAYES
Naive Bayes is a classification algorithm. The input is N - 1 feature columns, and the last column is a label column. All columns can be any data type. The label column must be discrete. The feature columns can be discrete or continuous. When you use continuous feature columns, you must specify which columns are continuous (see options).
Naive Bayes works by assuming that all features are equally important in the classification and that there is no correlation between features. With those assumptions, the algorithm computes all frequency information and saves it in three tables that you create using SQL SELECT statements.
When you execute the model, you specify N - 1 feature input arguments and the model returns the most likely class. The returned class is based on computing the class with the highest probability, given prior knowledge of the feature values. In other words, the class y has the highest value of P(y | x1, x2, …, xn).
metrics - If you set this option to true, the model calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. This option defaults to false.
continuousFeatures - If you specify this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start at 1.
Model Type: DECISION TREE
Decision tree is a classification model. The first N - 1 input columns are features and can be any data type. All non-numeric features must be discrete and must contain no more than the configured distinctCountLimit number of unique values. This limit is in place to prevent the internal model representation from growing too large. Numeric features are discrete by default and have the same limitation on the number of unique values, but they can be marked as continuous with the continuousFeatures option. For continuous features, the model builds the decision tree by dividing the values into two ranges instead of using discrete, unique values. The last input column is the label and can be any data type. The label must also have no more than the distinctCountLimit number of unique values.
When you create the model, you specify all features first, and then you specify the label as the last column in the result set.
You can use secondary indexes on discrete feature columns to greatly speed up training of a decision tree model.
When you execute the model, you must specify the N - 1 features as parameters. The model returns the expected label.
metrics - If you set this option to true, the model also calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. This option defaults to false.
continuousFeatures - If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1.
distinctCountLimit - If you set this option, the value must be a positive integer. This value sets the limit for how many distinct values a non-continuous feature and the label can contain. This option defaults to 256.
Model Type: LOGISTIC REGRESSION
This model fits a logistic curve to the data such that when the value is greater than 0.5, the result is one class, and when the value is less than 0.5, the result is the other class. This model is a binary classification algorithm.
The first N - 1 inputs are features and must be numeric. Features can be one-hot encoded. The last input column is the class or label. There must be exactly two non-NULL labels in the result set used to create the model. The model best fits the logistic curve using a negative log likelihood loss function. The model uses an algorithm that is a combination of particle swarm optimization, line search, and genetic algorithms to find the best fit parameters.
For faster, lower quality models, try reducing the popSize, initialIterations, and subsequentIterations options. Conversely, for slower, higher quality models, try increasing the values for these same options.
When you execute this model after training, you must supply the features as input and the label as the output. The label can be any data type.
metrics - If you set this option to true, the model calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. This option defaults to false.
popSize - If you specify this option, the value must be a positive integer. This value sets the population size for the particle swarm optimization (PSO) part of the algorithm. This option defaults to 100.
minInitParamValue - If you specify this option, the value must be a floating point number. This value sets the minimum for initial parameter values in the optimization algorithm. This option defaults to -10.
maxInitParamValue - If you specify this option, the value must be a floating point number. This value sets the maximum for initial parameter values in the optimization algorithm. This option defaults to 10.
initialIterations - If you specify this option, the value must be a positive integer. This value sets the number of PSO iterations for the first PSO pass. This option defaults to 500.
subsequentIterations - If you specify this option, the value must be a positive integer. This value sets the number of PSO iterations for subsequent iterations of the PSO algorithm. This option defaults to 100.
momentum - If you specify this option, the value must be a positive floating point. This parameter controls how much PSO iterations move away from the local best value to explore new territory. This option defaults to 0.1.
gravity - If you specify this option, the value must be a positive floating point. This parameter controls how much PSO iterations are drawn back towards the local best value. This option defaults to 0.01.
lossFuncNumSamples - If you specify this option, the value must be a positive integer. This parameter controls how many points are sampled when estimating the loss function. This option defaults to 1000.
numGAAttempts - If you specify this option, the value must be a positive integer. This parameter controls how many GA crossover possibilities the model tries. This option defaults to 10 million.
maxLineSearchIterations - If you specify this option, the value must be a positive integer. This parameter controls the maximum allowed number of iterations when the model runs the line search part of the algorithm. This option defaults to 200.
minLineSearchStepSize - If you specify this option, the value must be a positive floating point. This parameter controls the minimum step size that the line search algorithm ever takes. This option defaults to 1e-5.
samplesPerThread - If you specify this option, the value must be a positive integer. This parameter controls the target number of samples that are sent to each thread. Each thread independently computes a logistic regression model, and the threads are all combined at the end. This option defaults to 1 million.
Model Type: SUPPORT VECTOR MACHINE
Support Vector Machine (SVM) is a binary classification algorithm. SVM essentially finds a hypersurface (the hypersurface is a curve in 2-dimensional space) that correctly splits the data into two classes and maximizes the margin around the hypersurface. By default, SVM finds a hyperplane to split the data (the hyperplane is a straight line in 2-dimensional space). SVM uses a hinge loss function to balance the two objectives of finding a hyperplane with a wide margin while minimizing the number of incorrectly classified points.
The first N - 1 input columns are the features and must be numeric. The last column is the label and can be any arbitrary type.
For faster, lower quality models, reduce the popSize, initialIterations, and subsequentIterations options. Conversely, for slower, higher quality models, increase the values for these same options.
When you execute the model, the N - 1 features must be passed as parameters. The model returns the expected label.
metrics - If you set this option to true, the model also calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. This option defaults to false.
regularizationCoefficient - If you specify this option, the value must be a valid floating point number. This option is used to control the balance of finding a wide margin and minimizing incorrectly classified points in the loss function. When this value is larger (and positive) it makes having a wide margin around the hypersurface more important relative to the incorrectly classified points. Because of how implements SVM, the values for this parameter are likely different than values used in other common SVM implementations. This option defaults to 1.0 / 1000000.0.
functionN - By default, SVM uses a linear kernel. If you use a different kernel, you must provide a list of functions that are summed together, just like with linear combination regression. You must specify the first function using a key named 'function1'. Subsequent functions must use keys with names that use subsequent values of N. You must specify functions in SQL syntax, and should use the variables x1, x2, …, xn to refer to the 1st, 2nd, and nth independent variables respectively. You can specify the default linear kernel as: 'function1' → 'x1', 'function2' → 'x2', and so on. The model always adds a constant term equivalent to 'functionN' → '1.0' that you do not need to be explicitly specify.
popSize - If you set this option, the value must be a positive integer. This value sets the population size for the particle swarm optimization (PSO) part of the algorithm. This option defaults to 100.
minInitParamValue - If you set this option, the value must be a floating point number. Sets the minimum for initial parameter values in the optimization algorithm. This option defaults to -10.
maxInitParamValue - If you set this option, the value must be a floating point number. Sets the maximum for initial parameter values in the optimization algorithm. This option defaults to 10.
initialIterations - If you set this option, the value must be a positive integer. Sets the number of PSO iterations for the first PSO pass. This option defaults to 500.
subsequentIterations - If you set this option, the value must be a positive integer. Sets the number of PSO iterations for subsequent PSO iterations after the initial pass. This option defaults to 100.
momentum - If you set this option, the value must be a positive floating point number. This parameter controls how much PSO iterations move away from the local best value to explore new territory. This option defaults to 0.1.
gravity - If you set this option, the value must be a positive floating point number. This parameter controls how much PSO iterations are drawn back towards the local best value. This option defaults to 0.01.
lossFuncNumSamples - If you set this option, the value must be a positive integer. This parameter controls how many points the model samples when estimating the loss function. This option defaults to 1000.
numGAAttempts - If you set this option, the value must be a positive integer. This parameter controls how many GA crossover possibilities the model tries. This option defaults to 10 million.
maxLineSearchIterations - If you set this option, the value must be a positive integer. This parameter controls the maximum allowed number of iterations when running the line search part of the algorithm. This option defaults to 200.
minLineSearchStepSize - If you set this option, the value must be a positive floating point number. This parameter controls the minimum step size that the line search algorithm ever takes. This option defaults to 1e-5.
samplesPerThread - If you set this option, the value must be a positive integer number. This parameter controls the target number of samples that the model sends to each thread. Each thread independently computes a logistic regression model, and the models are all combined at the end. This option defaults to 1 million.
Other Models