# Classification Analysis

Classification machine learning algorithms are available in . There are many different popular classification algorithms, which work in slightly different ways with different strengths and weaknesses. These walkthroughs and examples cover some basic rules for using classification algorithms and how you can take advantage of these models in .

It can be very challenging to anticipate which classification model is best suited for a data set. Often, it is best to experiment with multiple different models to see what happens.

Despite its name, logistic regression is a classification algorithm, not a regression algorithm. Logistic regression is generally a good first step for trying out a classification model before moving on to more advanced options.

Logistic regression shares many restrictions with the support vector machines model. Both models require that all features are numeric. However, you can still use non-numeric data if you convert it using one-hot encoding. The class labels can be any data type.

Use one-hot encoding to convert non-numeric data into numeric form.

For example, if a feature is a color (such as red, green, or blue), then one-hot encoding can make three columns named is_red, is_green, and is_blue. For each row, set one color to the value 1 and the other colors to 0.

This example uses the lr_input table, which represents the academic performance of students.

There are two features: the number of hours spent studying hours_studied and the grade in the previous course grade_in_prev_course. Both input features are numeric. There are also two classes: true and false. Class values do not need to be Boolean, and they can be any distinct values.

Students who have better grades in the previous course and spend more hours studying are more likely to have top marks in the course. However, this table also includes a randomness factor, which means that rows with the same features can have different classifications. This is normal in real-world data.

This example creates the logistic regression model on the table, including the metrics option. This option causes the model to calculate the percentage of samples that are correctly classified. The Ocient System puts this information into the system catalog tables.

This query selects the machine_learning_models and logistic_regression_models system catalog tables.

The model captures the two values for the two target classes. The correctly_classified column shows the model has almost 84 percent accuracy from working on the training data. This level of accuracy is pretty good, especially for a model using training data with significant randomness.

This model can predict which classification is most likely if you provide the number of hours studied and grades in the previous course.

First, review the average hours spent studying for each classification.

As expected, the rows where got_top_marks is true have a higher average value for hours_studied. The same should be true with grade_in_prev_course.

You can also query to find the rows that are anomalies. In most cases, students achieved top marks in their class despite having minimal study hours and low grades in their previous classes.

For more information on using this model type, see Logistic Regression.

Support vector machine (SVM) tries to find a hyperplane to divide data into two classes with these objectives:

- The hyperplane has the maximum margin, meaning the most distance between the two classes.
- The model correctly classifies the highest percentage of data points.

By default, SVM models in Ocient try to balance these two objectives. Users can increase the priority of either objective by using the regularizationCoefficient option.

For data sets with two features, the SVM hyperplane represents a straight line on a Cartesian graph.

For three features, the hyperplane represents a plane.

This example uses the SVM model on the same classification data set of student performance as used in the logistic regression section.

The machine_learning_models and support_vector_machine_models system catalog tables for SVM contain almost the same information as logistic regression.

The system catalog tables indicate the same percentage of correctly classified rows as the logistic regression model (about 84 percent).

The SVM model can increase the accuracy further by using a kernel. A kernel is a transformation to make data linearly separable. SVM in Ocient allows you to specify any kernel of your choice.

For example, suppose that a circle can separate two classes of data, dividing data points between those inside and outside the circle. The SVM model cannot find a line to divide the two classes very well, or this problem is not linearly separable. However, you can transform the data into three dimensions (three features) to make it linearly separable. This is the idea of a kernel.

You specify the kernel using the options function1, function2, function3, etc. Each function defines the formula for the kernel to create new features. Functions can use features from the original data set, represented by x1, x2, x3, etc.

To explain this, revisit the circle example. Suppose that the circle centered at the point (2, 2) is the class inside. Otherwise, the circle is the class outside. The data that is separable by a circle in two dimensions is linearly separable in three dimensions with the features x1, x2, x1^2 + x2^2.

This statement creates the data and the model.

The example includes three functions, meaning that the kernel increases from two to three dimensions. The kernel functions refer to the original feature values x1 and x2. If you examine the correctly_classifiedvalue in the machine_learning_models and support_vector_machine_models system catalog tables, you can see a model that classifies nearly everything correctly.

One advantage of SVM kernels is that they are automatic. The model saves all the kernel information and transforms the data as needed. Hence, you can execute this model to predict a new point without redefining the kernel. You can pass in the two features the model was built over and a prediction comes out.

The sys.machine_learning_model_options table is the location where the SVM model saves kernel information.

This table saves all options that you specify. Some models also add additional keys to this dictionary to save certain information needed for when the model is executed.

For more information on using this model type, see Support Vector Machine.

A neural network using the FEEDFORWARD NETWORK model type can perform binary classification. To use this model for classification, you must either convert labels to either 0 and 1 (using the log_loss option) or convert labels to -1 and 1 (using the hinge_loss option). Both options can achieve the same result.

This model can handle more complex cases, but it might be easier to use a different model for linearly separable data. In other words, a neural network model is useful when data is difficult to linearly separate using a kernel.

This example shows a feedforward network model using the log_loss option. Note the CASE statement to convert labels to 0 or 1.

This query finds that the classification accuracy is about 97 percent.

To use the hinge_loss option, use -1 and 1 in the CASE statement.

The accuracy of the hinge_loss version is exactly the same as the log_loss version.

For more information on using this model type, see Feedforward Neural Network.

K-nearest neighbors (KNN) can only use numeric features, but KNN handles multi-classification in the Ocient System. This algorithm can handle an arbitrary number of target classes.

One drawback with KNN is that it does not work at training time other than to take a snapshot of the training data. The algorithm operates on the data when you execute the model, so it can be a slow model for large data sets.

To demonstrate KNN functionality, this example uses a similar circle of size (2, 2) with some modifications to use more classes:

- Data within a unit circle of (2, 2) is classified as INSIDE.
- Data within a radius of 2 is classified as MIDDLE.
- Any other data is classified as OUTSIDE.

KNN works by finding the k nearest points to the one the model is trying to predict a classification for. k is an integer value that you must specify during model creation.

The model calculates distances using normal Euclidian distances. You can override the definition of distance by specifying a formula using the distance option.

After the model identifies the points, it computes a score for each class and returns the class with the highest score. The weight option can influence each score for the specified point. By default, the weight value is based on the inverse of the distance (1.0/(d+0.1)). In other words, a point closer to the point you are trying to predict has a larger influence on the output result than a point further away. Any point outside of the nearest k points has no influence.

This example uses three classes.

These queries test some values to see if the classifications are correct.

KNN models perform work when you execute the model, and the amount of work increases as the input data increases.

One solution to this is data reduction, which means finding a smaller subset of rows to build the KNN model over, enough to ensure that the accuracy of the model is still high. This method front-loads the work when you create the model so that it is much faster to execute on new data.

To use the data reduction feature, set the dataReduction option to true. By default, the execution stops when the database finds a model that is 90 percent accurate over the training data or when the reduced model increases to 1,000 rows of data. You can override these defaults by setting the maxReducedRows or targetAccuracy options.

Additionally, if you are using data reduction, you can ask the database to collect metrics on the percent of rows correctly classified by the model, since this is much faster to do on the reduced data set.

This example builds another KNN model over the same data, but it also uses data reduction to find a model that is at least 95 percent accurate.

Query the machine_learning_models and k_nearest_neighbor_models system catalog tables to see the snapshot table used for the model and its accuracy.

The model with data reduction correctly classifies about 95 percent of the training data. To see the size of the KNN model snapshot, you can query it by its table name.

The snapshot is only 300 rows. Without data reduction, the row count would be much larger by orders of magnitude.

For more information on using this model type, see K Nearest Neighbor Classification.

The FEEDFORWARD NETWORK model type can also perform multi-class classification when the feature classes are numeric. Multi-class classification works by using vectors, where each position in the vector indicates the score for that class. The argmax value of the score of each vector determines its class.

Much of this example is similar to the previous Binary Classification with Neural Networks walkthrough with these modifications to the model options:

- Set the lossFunction option to cross_entropy.
- Set the outputs option to specify the number of classes for the model to use (in this example: 3).

When the loss function is cross_entropy_loss , the option useSoftMax defaults to true.

Execute this model to return a vector.

To get the classification, use the VECTOR_ARGMAX function to retrieve the index of the largest value.

Matrixes index the first row and first column with 1. However, the VECTOR_ARGMAX function starts the count at 0.

This query finds the overall accuracy of the model.

The model is about 93 percent effective. You can improve the accuracy by modifying the Model Options, including:

- hiddenLayers
- hiddenLayerSize
- lossFuncNumSamples

Increasing the resources of these options can improve accuracy, but could increase the time needed to train the model.

For more information on using this model type, see Feedforward Neural Network.

Both logistic regression and SVM are binary classifiers, but you can use either to build a multi-class classifier to handle any number of classes. can perform this automatically with either model.

This example shows the CREATE statement for the SVM model, which uses the same data as the K Nearest Neighbors and Multi-Class Classification with Neural Networks examples.

This statement uses the same kernel as the example for performing binary classification on Support Vector Machines.

This query checks the metrics in the machine_learning_models and support_vector_machine_models system catalog tables to see the accuracy.

This model is slightly more accurate than the neural network classifier.

This example tries logistic regression.

Logistic regression shows lower accuracy than the neural network model.

These classification models can handle non-numeric data.

The first of these models is naive Bayes. This model can handle binary or multi-class classification and features of any data type.

By default, naive Bayes treats each feature as a discrete value. For example, if you assign features with values of 1, 2, 3, and 4, the model does not give any numeric significance to them. In this case, this model cannot predict the class for the sample with a feature value of 3.5 because that value is not explicitly assigned as a feature.

However, the naive Bayes model can define certain input features as continuous features. These should be numeric values. In this case, the naive Bayes model finds the best fit normal distribution for that feature to do the probability calculations.

Naive Bayes models essentially use the probability rules of Bayes' Theorem to compute the most likely class based on the training data. The rules are only correct if the features are independent (uncorrelated). However, in practice, the models tend to work well with real-world data.

This example uses this simple table to demonstrate naive Bayes models.

color, type, and origin are the input features, which are all strings. true and false are the class labels. All of the input features in this case are discrete.

This example builds the naive Bayes model.

This query examines the information in the machine_learning_models and naive_bayes_models system catalog tables for naive Bayes models.

In the model, the correctly_classified value indicates the accuracy is 80 percent, meaning just two rows are classified incorrectly. The two table names in the temp schema hold all the necessary data for executing the model as a function.

This query calls the model to classify new data.

Here is another example with a different data set to show how continuous features work with naive Bayes models.

In this case, the only input feature is a continuous feature, but you can have multiple continuous features, and you can have a mix of continuous and discrete features.

To build a naive Bayes model over this data, you must set which input features are continuous in the model.

The continuousFeatures option is a comma-separated list (no spaces) of input column indexes that are continuous. Indexes start at 1.

This query checks the correctly_classified value in the machine_learning_models and naive_bayes_models system catalog tables.

Naive Bayes is a powerful classification algorithm, but it operates as a black box, making it hard to assess how it makes classification decisions.

For more information, see Naive Bayes Classification.

Unlike naive Bayes, the decision tree model allows you to see how the model works and makes its decisions.

A decision tree branches based on the value of a single feature, while the leaves of the tree are the output classes. For SQL syntax, this model uses multiple nested CASE statements. You can examine the whole nested CASE statement that defines the model, but it might be very long.

The decision tree model in Ocient has limited support for continuous features. The model finds the average value of the feature and allows a two-way branch based on whether the value is lower or higher than the average.

Creating a decision tree model is exactly like a naive Bayes model, except with a different type. Decision trees also use the continuousFeatures option, which functions the same.

These decision tree examples use the same data sets as the naive Bayes examples.

This query examines the machine_learning_models and decision_tree_models system catalog tables.

This model has the same 80 percent accuracy as the naive Bayes model.

This query examines the CASE statement that defines the model.

Like with the naive Bayes model, the decision tree model can predict new data.

This example builds a model over the golfing data.

Review the machine_learning_models and decision_tree_models system catalog tables to see the accuracy of this model.

Once again, the decision tree shows similar accuracy to the naive Bayes model.