Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.ocient.com/llms.txt

Use this file to discover all available pages before exploring further.

Classification machine learning algorithms are available in . There are many different popular classification algorithms, which work in slightly different ways with different strengths and weaknesses. These walkthroughs and examples cover some basic rules for using classification algorithms and how you can take advantage of these models in . It can be very challenging to anticipate which classification model is best suited for a data set. Often, it is best to experiment with multiple different models to see what happens.

Logistic Regression

Despite its name, logistic regression is a classification algorithm, not a regression algorithm. Logistic regression is generally a good first step for trying out a classification model before moving on to more advanced options. Logistic regression shares many restrictions with the support vector machines model. Both models require that all features are numeric. However, you can still use non-numeric data if you convert it using one-hot encoding. The class labels can be any data type.
Use one-hot encoding to convert non-numeric data into numeric form.For example, if a feature is a color (such as red, green, or blue), then one-hot encoding can make three columns named is_red, is_green, and is_blue. For each row, set one color to the value 1 and the other colors to 0.
This example uses the lr_input table, which represents the academic performance of students.
SQL
CREATE TABLE mldemo.lr_input AS (
    SELECT hours_studied,
        grade_in_prev_course,
        CASE
            WHEN RAND() > (1.0 - effective_score) THEN true
            ELSE false
        END AS got_top_marks
    FROM
       ( SELECT hours_studied,
                grade_in_pr ev_course,
                (2 * grade_in_prev_course + hours_studied) / 10.0 AS effective_score
            FROM
               ( SELECT mod(a.c1, 10) AS hours_studied,
                        mod(b.c1, 5) AS grade_in_prev_course
                    FROM sys.dummy100 a,
                        sys.dummy100 b
                )
        )
);
There are two features: the number of hours spent studying hours_studied and the grade in the previous course grade_in_prev_course. Both input features are numeric. There are also two classes: true and false. Class values do not need to be Boolean, and they can be any distinct values. Students who have better grades in the previous course and spend more hours studying are more likely to have top marks in the course. However, this table also includes a randomness factor, which means that rows with the same features can have different classifications. This is normal in real-world data. This example creates the logistic regression model on the table, including the metrics option. This option causes the model to calculate the percentage of samples that are correctly classified. The Ocient System puts this information into the system catalog tables.
SQL
CREATE MLMODEL lr1 TYPE LOGISTIC REGRESSION ON (
    SELECT *
    FROM mldemo.lr_input
) options('metrics'->'true');
This query selects the machine_learning_models and logistic_regression_models system catalog tables.
SQL
SELECT name,
    on_select,
    num_arguments,
    zero_case,
    one_case,
    correctly_classified
FROM sys.machine_learning_models a,
    sys.logistic_regression_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'lr1';

name                                         on_select                                    num_arguments zero_case                                    one_case                                     correctly_classified
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
lr1                                          select * from mldemo.lr_input                2             cast('true' as BOOLEAN)                      cast('false' as BOOLEAN)                     0.8397

Fetched 1 row
The model captures the two values for the two target classes. The correctly_classified column shows the model has almost 84 percent accuracy from working on the training data. This level of accuracy is pretty good, especially for a model using training data with significant randomness. This model can predict which classification is most likely if you provide the number of hours studied and grades in the previous course. First, review the average hours spent studying for each classification.
SQL
SELECT avg(hours_studied)
FROM mldemo.lr_input
WHERE got_top_marks;

(sum(hours_studied)) /(null_if(double(count(hours_studied)), (0)))
------------------------------------------------------------------
5.231264244536801

Fetched 1 row

SELECT avg(hours_studied)
FROM mldemo.lr_input
WHERE NOT got_top_marks;

(sum(hours_studied)) /(null_if(double(count(hours_studied)), (0)))
------------------------------------------------------------------
2.3534041715859897

Fetched 1 row
As expected, the rows where got_top_marks is true have a higher average value for hours_studied. The same should be true with grade_in_prev_course.
SQL
SELECT AVG(grade_in_prev_course)
FROM mldemo.lr_input
WHERE got_top_marks;

(SUM(grade_in_prev_course)) /(NULL_IF(DOUBLE(COUNT(grade_in_prev_course)), (0)))
--------------------------------------------------------------------------------
2.3535326451266925

Fetched 1 row

SELECT avg(grade_in_prev_course)
FROM mldemo.lr_input
WHERE NOT got_top_marks;

(sum(grade_in_prev_course)) /(NULL_IF(DOUBLE(COUNT(grade_in_prev_course)), (0)))
--------------------------------------------------------------------------------
0.9622195985832349

Fetched 1 row
You can also query to find the rows that are anomalies. In most cases, students achieved top marks in their class despite having minimal study hours and low grades in their previous classes.
SQL
SELECT *
FROM mldemo.lr_input
WHERE got_top_marks != lr1(hours_studied, grade_in_prev_course)
LIMIT 10;

hours_studied       grade_in_prev_course got_top_marks
-------------------------------------------------------
1                   1                    true
1                   1                    true
1                   4                    false
1                   1                    true
1                   2                    true
1                   0                    true
1                   2                    true
1                   1                    true
1                   2                    true
1                   1                    true

Fetched 10 rows
For details about using this model type, see Logistic Regression.

Support Vector Machine

Support vector machine (SVM) tries to find a hyperplane to divide data into two classes with these objectives:
  • The hyperplane has the maximum margin, meaning the most distance between the two classes.
  • The model correctly classifies the highest percentage of data points.
By default, SVM models in Ocient try to balance these two objectives. You can increase the priority of either objective by using the regularizationCoefficient option.
For data sets with two features, the SVM hyperplane represents a straight line on a Cartesian graph.For three features, the hyperplane represents a plane.
This example uses the SVM model on the same classification data set of student performance as used in the logistic regression section.
SQL
CREATE MLMODEL svm1 TYPE SUPPORT VECTOR MACHINE ON (
    SELECT *
    FROM mldemo.lr_input
) options('metrics'->'true');

Modified 0 rows
The machine_learning_models and support_vector_machine_models system catalog tables for SVM contain almost the same information as logistic regression.
SQL
SELECT name,
    on_select,
    num_arguments,
    negative_case,
    positive_case,
    correctly_classified
FROM sys.machine_learning_models a,
    sys.support_vector_machine_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'svm1';
name                                         on_select                                    num_arguments negative_case                                positive_case                                correctly_classified
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
svm1                                         select * from mldemo.lr_input                2             cast('false' as BOOLEAN)                     cast('true' as BOOLEAN)                      0.8397

Fetched 1 row
The system catalog tables indicate the same percentage of correctly classified rows as the logistic regression model (about 84 percent). The SVM model can increase the accuracy further by using a kernel. A kernel is a transformation to make data linearly separable. SVM in Ocient allows you to specify any kernel of your choice. For example, suppose that a circle can separate two classes of data, dividing data points between those inside and outside the circle. The SVM model cannot find a line to divide the two classes very well, or this problem is not linearly separable. However, you can transform the data into three dimensions (three features) to make it linearly separable. This is the idea of a kernel. You specify the kernel using the options function1, function2, function3, etc. Each function defines the formula for the kernel to create new features. Functions can use features from the original data set, represented by x1, x2, x3, etc. To explain this, revisit the circle example. Suppose that the circle centered at the point (2, 2) is the class inside. Otherwise, the circle is the class outside. The data that is separable by a circle in two dimensions is linearly separable in three dimensions with the features x1, x2, x1^2 + x2^2. This statement creates the data and the model.
SQL
CREATE TABLE mldemo.circle AS (
    SELECT x,
        y,
        CASE
            WHEN SQRT((x -2) ^ 2 + (y -2) ^ 2) <= 1 then 'INSIDE'
            ELSE 'OUTSIDE'
        END AS label
    FROM (
            SELECT RAND() * 10 as x,
                RAND() * 10 as y
            FROM sys.dummy10000
        )
);

Modified 10000 rows

CREATE MLMODEL svm2 type SUPPORT VECTOR MACHINE ON (
    SELECT x,
        y,
        label
    FROM mldemo.circle
) options(
    'metrics'->'true',
    'function1'->'x1',
    'function2'->'x2',
    'function3'->'x1^2 + x2^2'
);

Modified 0 rows
The example includes three functions, meaning that the kernel increases from two to three dimensions. The kernel functions refer to the original feature values x1 and x2. If you examine the correctly_classified value in the machine_learning_models and support_vector_machine_models system catalog tables, you can see a model that classifies nearly everything correctly.
SQL
SELECT NAME,
    negative_case,
    positive_case,
    correctly_classified
FROM sys.machine_learning_models a,
    sys.support_vector_machine_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'svm2';
name                                         negative_case                                positive_case                                correctly_classified
-------------------------------------------------------------------------------------------------------------------------------------------------------------
svm2                                         cast('OUTSIDE' as CHAR)                      cast('INSIDE' as CHAR)                       0.9913

Fetched 1 row
One advantage of SVM kernels is that they are automatic. The model saves all the kernel information and transforms the data as needed. Hence, you can execute this model to predict a new point without redefining the kernel. You can pass in the two features the model was built over, and a prediction comes out.
The sys.machine_learning_model_options table is the location where the SVM model saves kernel information.This table saves all the options that you specify. Some models also add additional keys to this dictionary to save certain information needed for when the model is executed.
SQL
SELECT svm2(2, 2);
COALESCE(('INSIDE'), (NULL));
---------------------------------------------
INSIDE

Fetched 1 row
For details about using this model type, see Support Vector Machine.

Binary Classification with Neural Networks

A neural network using the FEEDFORWARD NETWORK model type can perform binary classification. To use this model for classification, you must either convert labels to either 0 and 1 (using the log_loss option) or convert labels to -1 and 1 (using the hinge_loss option). Both options can achieve the same result. This model can handle more complex cases, but using a different model for linearly separable data might be easier. In other words, a neural network model is useful when data is difficult to separate linearly using a kernel. This example shows a feedforward network model using the log_loss option. Note the CASE statement to convert labels to 0 or 1.
SQL
CREATE MLMODEL logloss TYPE FEEDFORWARD NETWORK ON (
    SELECT x,
        y,
        CASE
            WHEN label = 'INSIDE' THEN 0
            ELSE 1
        END AS target
    FROM mldemo.circle
) options(
    'metrics'->'true',
    'hiddenLayers'->'2',
    'hiddenLayerSize'->'2',
    'outputs'->'1',
    'lossFunction'->'log_loss'
);

Modified 0 rows
This query finds that the classification accuracy is about 97 percent.
SQL
SELECT COUNT(*) / 10000.0
FROM (
        SELECT x,
            y,
            label,
            CASE
                WHEN logloss(x, y) = 0 then 'INSIDE'
                ELSE 'OUTSIDE'
            END AS predicted
        FROM mldemo.circle
    )
WHERE label = predicted;

(_count(*)_0)/((10000.0))
--------------------------
0.9698

Fetched 1 row
To use the hinge_loss option, use -1 and 1 in the CASE statement.
SQL
CREATE MLMODEL hingeloss TYPE FEEDFORWARD NETWORK ON (
    SELECT x,
        y,
        CASE
            WHEN label = 'INSIDE' THEN -1
            ELSE 1
        END AS target
    FROM mldemo.circle
) options(
    'metrics'->'true',
    'hiddenLayers'->'2',
    'hiddenLayerSize'->'2',
    'outputs'->'1',
    'lossFunction'->'hinge_loss'
);
Modified 0 rows
The accuracy of the hinge_loss version is exactly the same as the log_loss version.
SQL
SELECT count(*) / 10000.0
FROM (
        SELECT x,
            y,
            label,
            CASE
                WHEN hingeloss(x, y) = -1 THEN 'INSIDE'
                ELSE 'OUTSIDE'
            END AS predicted
        FROM mldemo.circle
    )
WHERE label = predicted;

(_count(*)_0)/((10000.0))
--------------------------
0.9698

Fetched 1 row
For details about using this model type, see Feedforward Neural Network.

K-Nearest Neighbors

K-nearest neighbors (KNN) can only use numeric features, but KNN handles multi-classification in the Ocient System. This algorithm can handle an arbitrary number of target classes. One drawback with KNN is that it does not work at training time other than to take a snapshot of the training data. The algorithm operates on the data when you execute the model, so it can be a slow model for large data sets. To demonstrate KNN functionality, this example uses a similar circle of size (2, 2) with some modifications to use more classes:
  • Data within a unit circle of (2, 2) is classified as INSIDE.
  • Data within a radius of 2 is classified as MIDDLE.
  • Any other data is classified as OUTSIDE.
SQL
CREATE TABLE mldemo.circles as (
    SELECT x, y,
        CASE
            WHEN SQRT((x -2) ^ 2 + (y -2) ^ 2) <= 1 THEN 'INSIDE'
            WHEN SQRT((x -2) ^ 2 + (y -2) ^ 2) <= 2 THEN 'MIDDLE'
            ELSE 'OUTSIDE'
        END AS label
    FROM (
            SELECT RAND() * 10 as x,
                RAND() * 10 as y
            FROM sys.dummy10000
        )
);

Modified 10000 rows
KNN works by finding the k nearest points to the one the model is trying to predict a classification for. k is an integer value that you must specify during model creation. The model calculates distances using normal Euclidean distances. You can override the definition of distance by specifying a formula using the distance option. After the model identifies the points, it computes a score for each class and returns the class with the highest score. The weight option can influence each score for the specified point. By default, the weight value is based on the inverse of the distance (1.0/(d+0.1)). In other words, a point closer to the point you are trying to predict has a larger influence on the output result than a point further away. Any point outside of the nearest k points has no influence. This example uses three classes.
SQL
CREATE MLMODEL knn1 TYPE K NEAREST NEIGHBORS ON (
    SELECT x,
        y,
        label
    FROM mldemo.circles
) options('k'->'3');

Modified 0 rows
These queries test some values to see if the classifications are correct.
SQL
SELECT knn1(2, 2);

---------------------------------------------
INSIDE

Fetched 1 row

SELECT knn1(0.5, 2);

---------------------------------------------
MIDDLE

Fetched 1 row

SELECT knn1(5,5);

---------------------------------------------
OUTSIDE

Fetched 1 row
KNN models perform work when you execute the model, and the amount of work increases as the input data increases. One solution to this is data reduction, which means finding a smaller subset of rows to build the KNN model over, enough to ensure that the accuracy of the model is still high. This method front-loads the work when you create the model so that it is much faster to execute on new data. To use the data reduction feature, set the dataReduction option to true. By default, the execution stops when the database finds a model that is 90 percent accurate over the training data or when the reduced model increases to 1,000 rows of data. You can override these defaults by setting the maxReducedRows or targetAccuracy options. Additionally, if you are using data reduction, you can ask the database to collect metrics on the percent of rows correctly classified by the model because this is much faster to do on the reduced data set. This example builds another KNN model over the same data and uses data reduction to find a model that is at least 95 percent accurate.
SQL
CREATE MLMODEL knn2 TYPE K NEAREST NEIGHBORS ON (
    SELECT x,
        y,
        label
    FROM mldemo.circles
) options(
    'k'->'3',
    'dataReduction'->'true',
    'metrics'->'true',
    'targetAccuracy'->'0.95'
);

Modified 0 rows
Query the machine_learning_models and k_nearest_neighbor_models system catalog tables to see the snapshot table used for the model and its accuracy.
SQL
SELECT table_name,
    correctly_classified
FROM sys.machine_learning_models a,
    sys.k_nearest_neighbor_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'knn2';

table_name                                   correctly_classified
-------------------------------------------------------------------
temp.knn86947930347148676                    0.9541

Fetched 1 row
The model with data reduction correctly classifies about 95 percent of the training data. To see the size of the KNN model snapshot, you can query it by its table name.
SQL
SELECT COUNT(*) FROM temp.knn86947930347148676;

count(*)
--------------------
300

Fetched 1 row
The snapshot is only 300 rows. Without data reduction, the row count would be much larger by orders of magnitude. For details about using this model type, see K Nearest Neighbor Classification.

Multi-Class Classification with Neural Networks

The FEEDFORWARD NETWORK model type can also perform multi-class classification when the feature classes are numeric. Multi-class classification works by using vectors, where each position in the vector indicates the score for that class. The argmax value of the score of each vector returns the class index. Much of this example is similar to the previous Binary Classification with Neural Networks tutorial, with these modifications to the model options:
  • Set the lossFunction option to cross_entropy.
  • Set the outputs option to specify the number of classes for the model to use (in this example: 3).
  • Set the useSoftMax option to true, which applies a softmax function to the final vector.
When the loss function is cross_entropy_loss , the option useSoftMax defaults to true.
SQL
CREATE MLMODEL crossentropy TYPE FEEDFORWARD NETWORK ON (
    SELECT x,
        y,
        CASE
            WHEN label = 'INSIDE' THEN { { 1.0,
            0.0,
            0.0 } }
            WHEN label = 'MIDDLE' THEN { { 0.0,
            1.0,
            0.0 } }
            ELSE { { 0.0,
            0.0,
            1.0 } }
        END AS TARGET
    FROM mldemo.circles
) options(
    'metrics'->'true',
    'hiddenLayers'->'2',
    'hiddenLayerSize'->'2',
    'outputs'->'3',
    'lossFunction'->'cross_entropy_loss',
    'useSoftMax'->'true'
);

Modified 0 rows
Execute this model to return a vector.
SQL
SELECT crossentropy(5,5) AS predicted;

predicted
--------------------------------------------------------------------------------
[[1.3436428911280516E-7, 1.811726873154417E-7, 0.9999996844630236]]

Fetched 1 row
To get the classification, use the VECTOR_ARGMAX function to retrieve the index of the largest value. This function returns the 1‑based index of the largest value in the vector.
SQL
SELECT VECTOR_ARGMAX(crossentropy(5,5)) AS predicted;

predicted
--------------------
3

Fetched 1 row
This query finds the overall accuracy of the model.
SQL
SELECT count(*) / 10000.0
FROM mldemo.circles
WHERE (
        label = 'INSIDE'
        AND VECTOR_ARGMAX(crossentropy(x, y)) = 1
    )
    OR (
        label = 'MIDDLE'
        AND VECTOR_ARGMAX(crossentropy(x, y)) = 2
    )
    OR (
        label = 'OUTSIDE'
        AND VECTOR_ARGMAX(crossentropy(x, y)) = 3
    );

(_count(*)_0)/((10000.0))
--------------------------
0.9298

Fetched 1 row
The model is about 93 percent effective. You can improve the accuracy by modifying the Model Options, including:
  • hiddenLayers
  • hiddenLayerSize
  • lossFuncNumSamples
Increasing the resources of these options can improve accuracy, but could increase the time needed to train the model. For details about using this model type, see Feedforward Neural Network.

Multi-Class Classification with Support Vector Machines and Logistic Regression

Both logistic regression and SVM are binary classifiers, but you can use either to build a multi-class classifier to handle any number of classes. OcientML can perform this automatically with either model. This example shows the CREATE statement for the SVM model, which uses the same data as the K Nearest Neighbors and Multi-Class Classification with Neural Networks examples.
SQL
CREATE MLMODEL circ1 TYPE SUPPORT VECTOR MACHINE ON (
    SELECT x,
        y,
        label
    FROM mldemo.circles
) options(
    'metrics'->'true',
    'function1'->'x1',
    'function2'->'x2',
    'function3'->'x1^2 + x2^2'
);

Modified 0 rows
This statement uses the same kernel as the example for performing binary classification on Support Vector Machines. This query checks the metrics in the machine_learning_models and support_vector_machine_models system catalog tables to see the accuracy.
SQL
SELECT correctly_classified
FROM sys.machine_learning_models a,
    sys.support_vector_machine_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'circ1';

correctly_classified
----------------------
0.9579

Fetched 1 row
This model is slightly more accurate than the neural network classifier. This example tries logistic regression.
SQL
CREATE mlmodel circ2 TYPE LOGISTIC REGRESSION ON (
    SELECT x,
        y,
        label
    FROM mldemo.circles
) options('metrics'->'true');

Modified 0 rows

SELECT correctly_classified
FROM sys.machine_learning_models a,
    sys.logistic_regression_models b
WHERE a.id = b.machine_learning_model_id
    AND NAME = 'circ2';

correctly_classified
----------------------
0.9028

Fetched 1 row
Logistic regression shows lower accuracy than the neural network model. These classification models can handle non-numeric data.

Naive Bayes

The first of these models is naive Bayes. This model can handle binary or multi-class classification and features of any data type. By default, naive Bayes treats each feature as a discrete value. For example, if you assign features with values of 1, 2, 3, and 4, the model does not give any numeric significance to them. In this case, this model cannot predict the class for the sample with a feature value of 3.5 because that value is not explicitly assigned as a feature. However, the naive Bayes model can define certain input features as continuous features. These should be numeric values. In this case, the naive Bayes model finds the best-fit normal distribution for that feature to do the probability calculations. Naive Bayes models essentially use the probability rules of Bayes’ Theorem to compute the most likely class based on the training data. The rules are only correct if the features are independent (uncorrelated). However, in practice, the models tend to work well with real-world data. This example uses this simple table to demonstrate naive Bayes models.
SQL
SELECT * FROM mldemo.cars;

color                                        type                                         origin                                       stolen
----------------------------------------------------------------------------------------------------------------------------------------------
yellow                                       sports                                       domestic                                     false
yellow                                       sports                                       imported                                     true
yellow                                       suv                                          imported                                     false
red                                          sports                                       domestic                                     false
yellow                                       suv                                          domestic                                     false
red                                          suv                                          imported                                     false
red                                          sports                                       domestic                                     true
yellow                                       suv                                          imported                                     true
red                                          sports                                       imported                                     true
red                                          sports                                       domestic                                     true

Fetched 10 rows
color, type, and origin are the input features, which are all strings. true and false are the class labels. In this case, all of the input features are discrete. This example builds the naive Bayes model.
SQL
CREATE MLMODEL nb1 TYPE NAIVE BAYES ON (
    SELECT color,
        type,
        origin,
        stolen
    FROM mldemo.cars
) options('metrics'->'true');

Modified 0 rows
This query examines the information in the machine_learning_models and naive_bayes_models system catalog tables for naive Bayes models.
SQL
SELECT NAME,
    on_select,
    num_arguments,
    result_probability_table,
    feature_result_matrix_table,
    correctly_classified
FROM sys.machine_learning_models a,
    sys.naive_bayes_models b
WHERE a.id = b.machine_le arning_model_id
    AND NAME = 'nb1';

name                                         on_select                                    num_arguments result_probability_table                     feature_result_matrix_table                  correctly_classified
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
nb1                                          select color, "type", origin, stolen from user.nb_input 3 temp.rpt83581840700868656                    temp.frm83581860860782808                    0.8

Fetched 1 row
In the model, the correctly_classified value indicates the accuracy is 80 percent, meaning just two rows are classified incorrectly. The two table names in the temp schema hold all the necessary data for executing the model as a function. This query calls the model to classify new data.
SQL
SELECT nb1('red', 'suv', 'domestic') AS prediction;

prediction
-----------
false

Fetched 1 row
Here is another example with a different data set to show how continuous features work with naive Bayes models.
SQL
SELECT * FROM mldemo.golf;

humidity            play
--------------------------
70                  false
75                  true
86                  true
95                  false
80                  true
85                  false
96                  true
90                  true
91                  false
90                  false
70                  true
70                  true
80                  true
65                  true

Fetched 14 rows
In this case, the only input feature is a continuous feature, but you can have multiple continuous features, and you can have a mix of continuous and discrete features. To build a naive Bayes model over this data, you must set which input features are continuous in the model.
SQL
CREATE MLMODEL nb2 TYPE NAIVE BAYES ON (
    SELECT *
    FROM mldemo.golf
) options('continuousFeatures'->'1', 'metrics'->'true');

Modified 0 rows
The continuousFeatures option is a comma-separated list (no spaces) of input column indexes that are continuous. Indexes start at 1. This query checks the correctly_classified value in the machine_learning_models and naive_bayes_models system catalog tables.
SQL
SELECT correctly_classified
FROM sys.machine_learning_models a,
    sys.naive_bayes_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'nb2';

correctly_classified
----------------------
0.7142857142857143

Fetched 1 row
Naive Bayes is a powerful classification algorithm, but it operates as a black box, making it hard to assess how it makes classification decisions. For details, see Naive Bayes Classification.

Decision Trees

Unlike naive Bayes, the decision tree model allows you to see how the model works and makes its decisions. A decision tree branches based on the value of a single feature, while the leaves of the tree are the output classes. For SQL syntax, this model uses multiple nested CASE statements. You can examine the whole nested CASE statement that defines the model, but it might be very long. The decision tree model in Ocient has limited support for continuous features. The model finds the average value of the feature and allows a two-way branch based on whether the value is lower or higher than the average. Creating a decision tree model is exactly like a naive Bayes model, except with a different type. Decision trees also use the continuousFeatures option, which works the same way. These decision tree examples use the same data sets as the naive Bayes examples.
SQL
CREATE MLMODEL dt1 TYPE DECISION TREE ON (
    SELECT color,
        "type",
        origin,
        stolen
    FROM mldemo.cars
) options('metrics'->'true');

Modified 0 rows
This query examines the machine_learning_models and decision_tree_models system catalog tables.
SQL
SELECT correctly_classified
FROM sys.machine_learning_models a,
    sys.decision_tree_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'dt1';

correctly_classified
----------------------
0.8

Fetched 1 row
This model has the same 80 percent accuracy as the naive Bayes model. This query examines the CASE statement that defines the model.
SQL
SELECT case_statement
FROM sys.machine_learning_models a,
    sys.decision_tree_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'dt1';

case_statement
---------------------------------------------
CASE
    WHEN x3 = cast('domestic' as CHAR) THEN CASE
        WHEN x1 = cast('red' as CHAR) THEN CASE
            WHEN x2 = cast('sports' as CHAR) THEN cast('true' as BOOLEAN)
            ELSE cast('false' as BOOLEAN)
        END
        WHEN x1 = cast('yellow' as CHAR) THEN cast('false' as BOOLEAN)
        ELSE cast('false' as BOOLEAN)
    END
    WHEN x3 = cast('imported' as CHAR) THEN CASE
        WHEN x2 = cast('suv' as CHAR) THEN CASE
            WHEN x1 = cast('red' as CHAR) THEN cast('false' as BOOLEAN)
            WHEN x1 = cast('yellow' as CHAR) THEN cast('true' as BOOLEAN)
            ELSE cast('false' as BOOLEAN)
        END
        WHEN x2 = cast('sports' as CHAR) THEN cast('true' as BOOLEAN)
        ELSE cast('false' as BOOLEAN)
    END
    ELSE cast('false' as BOOLEAN)
END

Fetched 1 row
Like with the naive Bayes model, the decision tree model can predict new data.
SQL
SELECT dt1('red', 'suv', 'domestic') AS prediction;

prediction
-----------
false

Fetched 1 row
This example builds a model over the golfing data.
SQL
CREATE mlmodel dt2 TYPE DECISION TREE ON (
    SELECT *
    FROM mldemo.golf
) options('continuousFeatures'->'1', 'metrics'->'true');

Modified 0 rows
Review the machine_learning_models and decision_tree_models system catalog tables to see the accuracy of this model.
SQL
SELECT correctly_classified
FROM sys.machine_learning_models a,
    sys.decision_tree_models b
WHERE a.id = b.machine_learning_model_id
    AND name = 'dt2';

correctly_classified
----------------------
0.7142857142857143

Fetched 1 row
Again, the decision tree shows similar accuracy to the naive Bayes model. Machine Learning Model Functions Classification Models
Last modified on May 27, 2026