Classification Models
supports classification models that involve understanding and grouping large data sets into preset categories or subpopulations. With the help of pre-classified training data sets, machine learning classification models leverage various algorithms to classify future data sets into respective and relevant categories.
To create the model, use the CREATE MLMODEL syntax. For details, see CREATE MLMODEL.
Model option names are case-sensitive.
Model Type: K NEAREST NEIGHBORS
K-nearest neighbors (KNN) is a classification algorithm, where the first N - 1 inputs are the features, which must be numeric. The last input column is a label, which can be any data type.
There is no training step for KNN. Instead, when you create the model, the model saves a copy of all input data to a table, so that when the model is executed in a later SQL statement, a snapshot of the data the model is supposed to use is available. You can override both the weight function and the distance function.
k — This option must be a positive integer that specifies how many closest points to use for classifying a new point.
dataReduction — If you specify this option to true, you enable the data reduction feature that finds a smaller subset of rows to build the KNN model over, enough to ensure that the accuracy of the model is still high. By default, the execution stops when the database finds a model that is 90 percent accurate over the training data or when the reduced model increases to 1,000 rows of data. You can override these defaults by setting the maxReducedRows or targetAccuracy options.
distance — If you specify this option, the value must be a function in SQL syntax for calculating the distance between a point used for classification and points in the training data set. This function should use the variables x1, x2, … for the 1st, 2nd, … features in the training data set, and p1, p2, … for the features in the point for classification. If you do not specify this option, the option defaults to the Euclidian distance.
maxReducedRows — If you specify this option, then you must set the dataReduction option to true. Specify the maximum number of rows for data reduction as a positive integer. The default value is 1,000 rows of data.
metrics — If you set this option to true, the model calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. The default value is false.
normalize — If you set this option to true, the model normalizes the data before the start of training. Normalization means to shift the mean to 0, and scale the data so the standard deviation is 1.
targetAccuracy — If you specify this option, then you must set the dataReduction option to true. Specify the model accuracy over the training data with a double value in the range (0,1]. The default value is 90 percent accuracy, or 0.90.
weight — If you specify this option, the value must be a function in SQL syntax for calculating the weight of a neighbor. The function should use the variable d for distance. By default, the distance is set to 1.0/(d+0.1), thus avoiding division by zero on exact inputs and still allowing neighbors to have some influence.
Create a KNN model with 8 closest points for classification and distance function power(x1 - p1, 2) + power(x2 - p2, 2) + power(x3 - p3, 2).
After you create the model, you can see its details by querying the sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model, the model executes with N - 1 features as input and returns a label. The model chooses the label from the class with the highest score. The model scores classes by summing the weights from the nearest k points in the training data.
After you execute a model, you can find the details about the execution results in the sys.k_nearest_neighbors_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in the System Catalog.
Model Type: NAIVE BAYES
Naive Bayes is a classification algorithm. The input is N - 1 feature columns, and the last column is a label column. All columns can be any data type. The label column must be discrete. The feature columns can be discrete or continuous. When you use continuous feature columns, you must specify which columns are continuous (see options).
Naive Bayes works by assuming that all features are equally important in the classification and that there is no correlation between features. With those assumptions, the algorithm computes all frequency information and saves it in three tables you create using SQL SELECT statements.
metrics — If you set this option to true, the model calculates the percentage of samples correctly classified by the model and saves this information in a catalog table. This option defaults to false.
continuousFeatures — If you specify this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start at 1.
Create a Naive Bayes model with feature indexes 1,3.
After you create the model, you can see its details by querying the sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model, you specify N - 1 feature input arguments and the model returns the most likely class. The returned class is based on computing the class with the highest probability, given prior knowledge of the feature values. In other words, the class y has the highest value of P(y | x1, x2, …, xn).
After you execute a model, you can find the details about the execution results in the sys.naive_bayes_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Model Type: DECISION TREE
The decision tree is a classification model. The first N - 1 input columns are features and can be any data type. All non-numeric features must be discrete and contain no more than the configured distinctCountLimit number of unique values. This limit is in place to prevent the internal model representation from growing too large. Numeric features are discrete by default and have the same limitation on the number of unique values, but they can be marked as continuous with the continuousFeatures option. For continuous features, the model builds the decision tree by dividing the values into two ranges instead of using discrete, unique values. The last input column is the label and can be any data type. The label must also have no more than the distinctCountLimit number of unique values.
When you create the model, you specify all features first, and then specify the label as the last column in the result set.
You can use secondary indexes on discrete feature columns to greatly speed up the training of a decision tree model.
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1. In the default state, the model considers no features as continuous.
distinctCountLimit — If you set this option, the value must be a positive integer. This value sets the limit for how many distinct values a non-continuous feature and the label can contain. This option defaults to 256.
doPrune — If you set this option to true, the model uses Pessimistic Error Pruning (PEP) to prune the tree after training. This option defaults to false.
featureArray — If you set this option to true, all features must be present in the one array column and all arrays in that column have the same length. This option defaults to false.
featureArrayElements — If you set this option, the value must be a comma-separated list of the feature indexes. If you set the featureArray option to true, this list determines the elements of the arrays to use for training. By default, the model uses all elements.
maxDepth — If you set this option, the value must be a positive integer. This value sets the maximum allowable depth of the decision tree (the maximum number of features to split on). The default is unspecified, which means there is no maximum depth.
metrics — If you set this option to true, the model also calculates the percentage of samples correctly classified by the model and saves this information in a catalog table. This option defaults to false.
noSnapshot — If you set this option to true, the database does not create an intermediate table that stores the result of the specified SQL statement, which the model uses for training. This option defaults to false, and the database creates and uses the intermediate table. Setting this option to true is useful when the training set is fixed. If the training set is a table with modifications, set this option to false as the decision tree trainer uses different data sets in different parts of the tree. Likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that running that query twice generates the same 100 rows each time.
numSplits — If you set this option, the value must be an integer greater than 1. This value sets the maximum number of branches a continuous feature can have. The default value is 32.
ROCNumSamples — If you set the option, you must also set the metrics option. This positive integer indicates the number of samples for the model to use for the area under the ROC curve. The default value is 10.
skipLimitCheck — If you set this option to true, the model skips cardinality checks that throw errors when columns have too many values. The limit that this option checks is the same one that is specified by the distinctCountLimit option. This option defaults to false.
weighted — If you set this option, the model considers weights for labels. If you set this option value to true, you must specify an additional column as a double in the training data for label weights. Rows with the same labels must have the same weights. If you set this value to auto, the model calculates weights automatically by weighing each label according to the ratio of the count of the most frequent label to the count of the specified label. As a result, the most frequent label has a weight of 1.0, and the other label weights are higher. This option defaults to false, which means all labels have equal weight.
Create a decision tree model with feature indexes 1,3.
After you create the model, you can see its details by querying the sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model, you must specify the N - 1 features as parameters. The model returns the expected label.
After you execute a model, you can find the details about the execution results in the sys.decision_tree_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Model Type: RANDOM FOREST
The random forest is a classification model. The model consists of decision trees and returns the most common label among the results from the trees. The result of the model consists of the most common label and an array of pairs of labels and their frequencies, sorted in descending order by frequency.
The model breaks ties in various ways depending on the type of label. For strings, the model uses lexicographic order (reverse of the usual alphabetic order), so C comes before A, for example. For Booleans, the model chooses true before false. For numeric types, the model chooses the largest number.
numChildren — Number of child decision trees.
continuousFeatures — If you set this option, the value must be a comma-separated list of the feature indexes that are continuous numeric variables. Indexes start with 1.
distinctCountLimit — If you set this option, the value must be a positive integer. This value limits how many distinct values a non-continuous feature and the label can contain. The default value is 256.
featureArray — If you set this option to true, all features must be present in one array column where all arrays are the same length. The default value is false.
fractionSelected — The proportion of rows the model uses to train each decision tree. The value is a double that must be in the interval (0, 1]. You cannot set this option if you also set the rowsPerChild option to a positive value. The default behavior is that the model uses all available rows.
inputsPerChild — Number of features used to create each child decision tree. The default value is the number of features you specify for the forest divided by 3 and rounded up.
maxDepth — If you set this option, the value must be a positive integer. This value sets the maximum allowable depth of the decision tree. The default value is 3.
maxThreads — The maximum number of parallel threads to use while the model trains decision trees. This value must be a positive integer. The default value is 16.
metrics — If you set this option to true, the model also calculates the percentage of samples that are correctly classified by the model for the random forest and saves this information in a catalog table. This option is always set to false for the child trees.
noSnapshot — If you set this option to true, the data source must not change. In this case, the database does not create an intermediate table that stores the result of the specified SQL statement, which the model uses for training a random forest. Child decision trees always have this option set to true, so the database does not create a separate intermediate table for each decision tree. The default value is false. Setting this option to true is useful when the training set is fixed. If the training set is a table with modifications, set this option to false as the decision tree trainer uses different data sets in different parts of the tree. Likewise, if the training set consists of a query that returns 100 rows, then set this option to false because there is no guarantee that running that query twice generates the same 100 rows each time.
requiredFeatures — A comma-separated list of integers as strings representing specific features where the first feature has the value 1. The model uses these features in every decision tree in the forest. The default behavior is that the decision tree in the forest can train on any feature that is in the list.
rowsPerChild — If you set this option to a positive integer, the number represents the number of rows (from a random sample) to use for each decision tree. If you set this option to 0, each child uses all available rows. The default value is 0. You cannot set this option to a positive value if you also set the fractionSelected option.
skipLimitCheck — If you set this option to true, the model skips cardinality checks that throw errors when columns have too many values. The limit that this option checks is the same one that is specified by the distinctCountLimit option. This option defaults to false.
Create a random forest model with four child decision trees and two features randomly chosen by the model for each tree. The training_table table contains the training data set. Collect metrics for the model execution by setting the metrics option to true.
After you create the model, you can see its details by querying the sys.machine_learning_models, sys.machine_learning_model_options, and sys.random_forest_models system catalog tables.
Execute this model using three columns a, b, and c, from large_table table that contains the whole data set.
The output of the model is a tuple containing the most common label as the first element. The second element is an array of tuples, where the first element is the label and the second is the frequency. The array is sorted by frequency from highest to lowest, where the sum of frequencies is 1. When two labels have the same frequency, the model breaks the tie according to lexicographic order.
To retrieve the most common label, use [] with an index of 1, [1], for example, test_model(a,b,c)[1].
After you execute a model, you can find the results in the output of the model function execution.
For details, see the description of the associated system catalog tables in the Machine Learning section in System Catalog.
Model Type: LOGISTIC REGRESSION
This model fits a logistic curve to the data across any number of classes greater than one.
The first N - 1 inputs are features and must be numeric. Features can be one-hot encoded. The last input column is the class or label. You must have at least one non-NULL label in the result set for model creation. The model best fits the logistic curve using a negative log likelihood loss function. The model uses an algorithm that is a combination of particle swarm optimization, line search, and genetic algorithms to find the best-fit parameters.
For faster, lower-quality models, try reducing the popSize, initialIterations, and subsequentIterations options. Conversely, for slower, higher-quality models, try increasing the values for these same options.
metrics — If you set this option to true, the model calculates the percentage of samples that are correctly classified by the model and saves this information in the sys.logistic_regression_models system catalog table. This option defaults to false.
popSize — If you specify this option, the value must be a positive integer. This value sets the population size for the particle swarm optimization (PSO) part of the algorithm. This option defaults to 100.
minInitParamValue — If you specify this option, the value must be a floating point number. This value sets the minimum for initial parameter values in the optimization algorithm. This option defaults to -10.
maxInitParamValue — If you specify this option, the value must be a floating point number. This value sets the maximum for initial parameter values in the optimization algorithm. This option defaults to 10.
initialIterations — If you specify this option, the value must be a positive integer. This value sets the number of PSO iterations for the first PSO pass. This option defaults to 500.
subsequentIterations — If you specify this option, the value must be a positive integer. This value sets the number of PSO iterations for subsequent iterations of the PSO algorithm. This option defaults to 100.
momentum — If you specify this option, the value must be a positive floating point. This parameter controls how much PSO iterations move away from the local best value to explore new territory. This option defaults to 0.1.
gravity — If you specify this option, the value must be a positive floating point. This parameter controls how much PSO iterations are drawn back towards the local best value. This option defaults to 0.01.
lossFuncNumSamples — If you specify this option, the value must be a positive integer. This parameter controls how many points are sampled when estimating the loss function. This option defaults to 1000.
numGAAttempts — If you specify this option, the value must be a positive integer. This parameter controls how many GA crossover possibilities the model tries. This option defaults to 10 million.
maxLineSearchIterations — If you specify this option, the value must be a positive integer. This parameter controls the maximum allowed number of iterations when the model runs the line search part of the algorithm. This option defaults to 200.
minLineSearchStepSize — If you specify this option, the value must be a positive floating point. This parameter controls the minimum step size the line search algorithm ever takes. This option defaults to 1e-5.
samplesPerThread — If you specify this option, the value must be a positive integer. This parameter controls the target number of samples sent to each thread. Each thread independently computes a logistic regression model, and the threads are all combined at the end. This option defaults to 1 million.
Create a logistic regression model. Collect metrics for the model execution by setting the metrics option to true.
After you create the model, you can see its details by querying the sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute this model after training, you must specify the features as input and the label as the output. The label can be any data type.
After you execute a model, you can find the details about the execution results in the sys.logistic_regression_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in the System Catalog.
Model Type: SUPPORT VECTOR MACHINE
Support Vector Machine (SVM) essentially finds a hypersurface (the hypersurface is a curve in 2-dimensional space) that correctly splits the data into any number of classes greater than one and maximizes the margin around the hypersurface. By default, SVM finds a hyperplane to split the data (the hyperplane is a straight line in 2-dimensional space). SVM uses a hinge loss function to balance the two objectives of finding a hyperplane with a wide margin while minimizing the number of incorrectly classified points.
The first N - 1 input columns are the features and must be numeric. The last column is the label and can be any arbitrary type. You must have at least one non-NULL label in the result set for model creation.
For faster, lower-quality models, reduce the popSize, initialIterations, and subsequentIterations options. Conversely, for slower, higher-quality models, increase the values for these same options.
metrics — If you set this option to true, the model also calculates the percentage of samples that are correctly classified by the model and saves this information in a catalog table. This option defaults to false.
regularizationCoefficient — If you specify this option, the value must be a valid floating point number. This option is used to control the balance of finding a wide margin and minimizing incorrectly classified points in the loss function. When this value is larger (and positive) it makes having a wide margin around the hypersurface more important relative to the incorrectly classified points. Because of how implements SVM, the values for this parameter are likely different than values used in other common SVM implementations. This option defaults to 1.0 / 1000000.0.
functionN — By default, SVM uses a linear kernel. If you use a different kernel, you must provide a list of functions that are summed together, just like with linear combination regression. You must specify the first function using a key named 'function1'. Subsequent functions must use keys with names that use subsequent values of N. You must specify functions in SQL syntax and use the variables x1, x2, …, xn to refer to the 1st, 2nd, and nth independent variables, respectively. You can specify the default linear kernel as: 'function1' → 'x1', 'function2' → 'x2', and so on. The model always adds a constant term equivalent to 'functionN' → '1.0' that you do not need to specify explicitly.
popSize — If you set this option, the value must be a positive integer. This value sets the population size for the particle swarm optimization (PSO) part of the algorithm. This option defaults to 100.
minInitParamValue — If you set this option, the value must be a floating point number. Sets the minimum for initial parameter values in the optimization algorithm. This option defaults to -10.
maxInitParamValue — If you set this option, the value must be a floating point number. Sets the maximum for initial parameter values in the optimization algorithm. This option defaults to 10.
initialIterations — If you set this option, the value must be a positive integer. Sets the number of PSO iterations for the first PSO pass. This option defaults to 500.
subsequentIterations — If you set this option, the value must be a positive integer. Sets the number of PSO iterations for subsequent PSO iterations after the initial pass. This option defaults to 100.
momentum — If you set this option, the value must be a positive floating point number. This parameter controls how much PSO iterations move away from the local best value to explore new territory. This option defaults to 0.1.
gravity — If you set this option, the value must be a positive floating point number. This parameter controls how much PSO iterations are drawn back towards the local best value. This option defaults to 0.01.
lossFuncNumSamples — If you set this option, the value must be a positive integer. This parameter controls how many points the model samples when estimating the loss function. This option defaults to 1000.
numGAAttempts — If you set this option, the value must be a positive integer. This parameter controls how many GA crossover possibilities the model tries. This option defaults to 10 million.
maxLineSearchIterations — If you set this option, the value must be a positive integer. This parameter controls the maximum allowed number of iterations when running the line search part of the algorithm. This option defaults to 200.
minLineSearchStepSize — If you set this option, the value must be a positive floating point number. This parameter controls the minimum step size the line search algorithm ever takes. This option defaults to 1e-5.
samplesPerThread — If you set this option, the value must be a positive integer number. This parameter controls the target number of samples the model sends to each thread. Each thread independently computes a logistic regression model, and the models are all combined at the end. This option defaults to 1 million.
Create a support vector machine model.
After you create the model, you can see its details by querying the sys.machine_learning_models and sys.machine_learning_model_options system catalog tables.
When you execute the model, the N - 1 features must be passed as parameters. The model returns the expected label.
After you execute a model, you can find the details about the execution results in the sys.support_vector_machine_models system catalog table.
For details, see the description of the associated system catalog tables in the Machine Learning section in the System Catalog.
Other Models