Documentation Index
Fetch the complete documentation index at: https://docs.ocient.com/llms.txt
Use this file to discover all available pages before exploring further.
Classification machine learning algorithms are available in . There are many different popular classification algorithms, which work in slightly different ways with different strengths and weaknesses. These walkthroughs and examples cover some basic rules for using classification algorithms and how you can take advantage of these models in .
It can be very challenging to anticipate which classification model is best suited for a data set. Often, it is best to experiment with multiple different models to see what happens.
Logistic Regression
Despite its name, logistic regression is a classification algorithm, not a regression algorithm. Logistic regression is generally a good first step for trying out a classification model before moving on to more advanced options.
Logistic regression shares many restrictions with the support vector machines model. Both models require that all features are numeric. However, you can still use non-numeric data if you convert it using one-hot encoding. The class labels can be any data type.
Use one-hot encoding to convert non-numeric data into numeric form.For example, if a feature is a color (such as red, green, or blue), then one-hot encoding can make three columns named is_red, is_green, and is_blue. For each row, set one color to the value 1 and the other colors to 0.
This example uses the lr_input table, which represents the academic performance of students.
CREATE TABLE mldemo.lr_input AS (
SELECT hours_studied,
grade_in_prev_course,
CASE
WHEN RAND() > (1.0 - effective_score) THEN true
ELSE false
END AS got_top_marks
FROM
( SELECT hours_studied,
grade_in_pr ev_course,
(2 * grade_in_prev_course + hours_studied) / 10.0 AS effective_score
FROM
( SELECT mod(a.c1, 10) AS hours_studied,
mod(b.c1, 5) AS grade_in_prev_course
FROM sys.dummy100 a,
sys.dummy100 b
)
)
);
There are two features: the number of hours spent studying hours_studied and the grade in the previous course grade_in_prev_course. Both input features are numeric. There are also two classes: true and false. Class values do not need to be Boolean, and they can be any distinct values.
Students who have better grades in the previous course and spend more hours studying are more likely to have top marks in the course. However, this table also includes a randomness factor, which means that rows with the same features can have different classifications. This is normal in real-world data.
This example creates the logistic regression model on the table, including the metrics option. This option causes the model to calculate the percentage of samples that are correctly classified. The Ocient System puts this information into the system catalog tables.
CREATE MLMODEL lr1 TYPE LOGISTIC REGRESSION ON (
SELECT *
FROM mldemo.lr_input
) options('metrics'->'true');
This query selects the machine_learning_models and logistic_regression_models system catalog tables.
SELECT name,
on_select,
num_arguments,
zero_case,
one_case,
correctly_classified
FROM sys.machine_learning_models a,
sys.logistic_regression_models b
WHERE a.id = b.machine_learning_model_id
AND name = 'lr1';
name on_select num_arguments zero_case one_case correctly_classified
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
lr1 select * from mldemo.lr_input 2 cast('true' as BOOLEAN) cast('false' as BOOLEAN) 0.8397
Fetched 1 row
The model captures the two values for the two target classes. The correctly_classified column shows the model has almost 84 percent accuracy from working on the training data. This level of accuracy is pretty good, especially for a model using training data with significant randomness.
This model can predict which classification is most likely if you provide the number of hours studied and grades in the previous course.
First, review the average hours spent studying for each classification.
SELECT avg(hours_studied)
FROM mldemo.lr_input
WHERE got_top_marks;
(sum(hours_studied)) /(null_if(double(count(hours_studied)), (0)))
------------------------------------------------------------------
5.231264244536801
Fetched 1 row
SELECT avg(hours_studied)
FROM mldemo.lr_input
WHERE NOT got_top_marks;
(sum(hours_studied)) /(null_if(double(count(hours_studied)), (0)))
------------------------------------------------------------------
2.3534041715859897
Fetched 1 row
As expected, the rows where got_top_marks is true have a higher average value for hours_studied. The same should be true with grade_in_prev_course.
SELECT AVG(grade_in_prev_course)
FROM mldemo.lr_input
WHERE got_top_marks;
(SUM(grade_in_prev_course)) /(NULL_IF(DOUBLE(COUNT(grade_in_prev_course)), (0)))
--------------------------------------------------------------------------------
2.3535326451266925
Fetched 1 row
SELECT avg(grade_in_prev_course)
FROM mldemo.lr_input
WHERE NOT got_top_marks;
(sum(grade_in_prev_course)) /(NULL_IF(DOUBLE(COUNT(grade_in_prev_course)), (0)))
--------------------------------------------------------------------------------
0.9622195985832349
Fetched 1 row
You can also query to find the rows that are anomalies. In most cases, students achieved top marks in their class despite having minimal study hours and low grades in their previous classes.
SELECT *
FROM mldemo.lr_input
WHERE got_top_marks != lr1(hours_studied, grade_in_prev_course)
LIMIT 10;
hours_studied grade_in_prev_course got_top_marks
-------------------------------------------------------
1 1 true
1 1 true
1 4 false
1 1 true
1 2 true
1 0 true
1 2 true
1 1 true
1 2 true
1 1 true
Fetched 10 rows
For details about using this model type, see Logistic Regression.
Support Vector Machine
Support vector machine (SVM) tries to find a hyperplane to divide data into two classes with these objectives:
- The hyperplane has the maximum margin, meaning the most distance between the two classes.
- The model correctly classifies the highest percentage of data points.
By default, SVM models in Ocient try to balance these two objectives. You can increase the priority of either objective by using the regularizationCoefficient option.
For data sets with two features, the SVM hyperplane represents a straight line on a Cartesian graph.For three features, the hyperplane represents a plane.
This example uses the SVM model on the same classification data set of student performance as used in the logistic regression section.
CREATE MLMODEL svm1 TYPE SUPPORT VECTOR MACHINE ON (
SELECT *
FROM mldemo.lr_input
) options('metrics'->'true');
Modified 0 rows
The machine_learning_models and support_vector_machine_models system catalog tables for SVM contain almost the same information as logistic regression.
SELECT name,
on_select,
num_arguments,
negative_case,
positive_case,
correctly_classified
FROM sys.machine_learning_models a,
sys.support_vector_machine_models b
WHERE a.id = b.machine_learning_model_id
AND name = 'svm1';
name on_select num_arguments negative_case positive_case correctly_classified
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
svm1 select * from mldemo.lr_input 2 cast('false' as BOOLEAN) cast('true' as BOOLEAN) 0.8397
Fetched 1 row
The system catalog tables indicate the same percentage of correctly classified rows as the logistic regression model (about 84 percent).
The SVM model can increase the accuracy further by using a kernel. A kernel is a transformation to make data linearly separable. SVM in Ocient allows you to specify any kernel of your choice.
For example, suppose that a circle can separate two classes of data, dividing data points between those inside and outside the circle. The SVM model cannot find a line to divide the two classes very well, or this problem is not linearly separable. However, you can transform the data into three dimensions (three features) to make it linearly separable. This is the idea of a kernel.
You specify the kernel using the options function1, function2, function3, etc. Each function defines the formula for the kernel to create new features. Functions can use features from the original data set, represented by x1, x2, x3, etc.
To explain this, revisit the circle example. Suppose that the circle centered at the point (2, 2) is the class inside. Otherwise, the circle is the class outside. The data that is separable by a circle in two dimensions is linearly separable in three dimensions with the features x1, x2, x1^2 + x2^2.
This statement creates the data and the model.
CREATE TABLE mldemo.circle AS (
SELECT x,
y,
CASE
WHEN SQRT((x -2) ^ 2 + (y -2) ^ 2) <= 1 then 'INSIDE'
ELSE 'OUTSIDE'
END AS label
FROM (
SELECT RAND() * 10 as x,
RAND() * 10 as y
FROM sys.dummy10000
)
);
Modified 10000 rows
CREATE MLMODEL svm2 type SUPPORT VECTOR MACHINE ON (
SELECT x,
y,
label
FROM mldemo.circle
) options(
'metrics'->'true',
'function1'->'x1',
'function2'->'x2',
'function3'->'x1^2 + x2^2'
);
Modified 0 rows
The example includes three functions, meaning that the kernel increases from two to three dimensions. The kernel functions refer to the original feature values x1 and x2. If you examine the correctly_classified value in the machine_learning_models and support_vector_machine_models system catalog tables, you can see a model that classifies nearly everything correctly.
SELECT NAME,
negative_case,
positive_case,
correctly_classified
FROM sys.machine_learning_models a,
sys.support_vector_machine_models b
WHERE a.id = b.machine_learning_model_id
AND name = 'svm2';
name negative_case positive_case correctly_classified
-------------------------------------------------------------------------------------------------------------------------------------------------------------
svm2 cast('OUTSIDE' as CHAR) cast('INSIDE' as CHAR) 0.9913
Fetched 1 row
One advantage of SVM kernels is that they are automatic. The model saves all the kernel information and transforms the data as needed. Hence, you can execute this model to predict a new point without redefining the kernel. You can pass in the two features the model was built over, and a prediction comes out.
The sys.machine_learning_model_options table is the location where the SVM model saves kernel information.This table saves all the options that you specify. Some models also add additional keys to this dictionary to save certain information needed for when the model is executed.
SELECT svm2(2, 2);
COALESCE(('INSIDE'), (NULL));
---------------------------------------------
INSIDE
Fetched 1 row
For details about using this model type, see Support Vector Machine.
Binary Classification with Neural Networks
A neural network using the FEEDFORWARD NETWORK model type can perform binary classification. To use this model for classification, you must either convert labels to either 0 and 1 (using the log_loss option) or convert labels to -1 and 1 (using the hinge_loss option). Both options can achieve the same result.
This model can handle more complex cases, but using a different model for linearly separable data might be easier. In other words, a neural network model is useful when data is difficult to separate linearly using a kernel.
This example shows a feedforward network model using the log_loss option. Note the CASE statement to convert labels to 0 or 1.
CREATE MLMODEL logloss TYPE FEEDFORWARD NETWORK ON (
SELECT x,
y,
CASE
WHEN label = 'INSIDE' THEN 0
ELSE 1
END AS target
FROM mldemo.circle
) options(
'metrics'->'true',
'hiddenLayers'->'2',
'hiddenLayerSize'->'2',
'outputs'->'1',
'lossFunction'->'log_loss'
);
Modified 0 rows
This query finds that the classification accuracy is about 97 percent.
SELECT COUNT(*) / 10000.0
FROM (
SELECT x,
y,
label,
CASE
WHEN logloss(x, y) = 0 then 'INSIDE'
ELSE 'OUTSIDE'
END AS predicted
FROM mldemo.circle
)
WHERE label = predicted;
(_count(*)_0)/((10000.0))
--------------------------
0.9698
Fetched 1 row
To use the hinge_loss option, use -1 and 1 in the CASE statement.
CREATE MLMODEL hingeloss TYPE FEEDFORWARD NETWORK ON (
SELECT x,
y,
CASE
WHEN label = 'INSIDE' THEN -1
ELSE 1
END AS target
FROM mldemo.circle
) options(
'metrics'->'true',
'hiddenLayers'->'2',
'hiddenLayerSize'->'2',
'outputs'->'1',
'lossFunction'->'hinge_loss'
);
Modified 0 rows
The accuracy of the hinge_loss version is exactly the same as the log_loss version.
SELECT count(*) / 10000.0
FROM (
SELECT x,
y,
label,
CASE
WHEN hingeloss(x, y) = -1 THEN 'INSIDE'
ELSE 'OUTSIDE'
END AS predicted
FROM mldemo.circle
)
WHERE label = predicted;
(_count(*)_0)/((10000.0))
--------------------------
0.9698
Fetched 1 row
For details about using this model type, see Feedforward Neural Network.
K-Nearest Neighbors
K-nearest neighbors (KNN) can only use numeric features, but KNN handles multi-classification in the Ocient System. This algorithm can handle an arbitrary number of target classes.
One drawback with KNN is that it does not work at training time other than to take a snapshot of the training data. The algorithm operates on the data when you execute the model, so it can be a slow model for large data sets.
To demonstrate KNN functionality, this example uses a similar circle of size (2, 2) with some modifications to use more classes:
- Data within a unit circle of (2, 2) is classified as
INSIDE.
- Data within a radius of 2 is classified as
MIDDLE.
- Any other data is classified as
OUTSIDE.
CREATE TABLE mldemo.circles as (
SELECT x, y,
CASE
WHEN SQRT((x -2) ^ 2 + (y -2) ^ 2) <= 1 THEN 'INSIDE'
WHEN SQRT((x -2) ^ 2 + (y -2) ^ 2) <= 2 THEN 'MIDDLE'
ELSE 'OUTSIDE'
END AS label
FROM (
SELECT RAND() * 10 as x,
RAND() * 10 as y
FROM sys.dummy10000
)
);
Modified 10000 rows
KNN works by finding the k nearest points to the one the model is trying to predict a classification for. k is an integer value that you must specify during model creation.
The model calculates distances using normal Euclidean distances. You can override the definition of distance by specifying a formula using the distance option.
After the model identifies the points, it computes a score for each class and returns the class with the highest score. The weight option can influence each score for the specified point. By default, the weight value is based on the inverse of the distance (1.0/(d+0.1)). In other words, a point closer to the point you are trying to predict has a larger influence on the output result than a point further away. Any point outside of the nearest k points has no influence.
This example uses three classes.
CREATE MLMODEL knn1 TYPE K NEAREST NEIGHBORS ON (
SELECT x,
y,
label
FROM mldemo.circles
) options('k'->'3');
Modified 0 rows
These queries test some values to see if the classifications are correct.
SELECT knn1(2, 2);
---------------------------------------------
INSIDE
Fetched 1 row
SELECT knn1(0.5, 2);
---------------------------------------------
MIDDLE
Fetched 1 row
SELECT knn1(5,5);
---------------------------------------------
OUTSIDE
Fetched 1 row
KNN models perform work when you execute the model, and the amount of work increases as the input data increases.
One solution to this is data reduction, which means finding a smaller subset of rows to build the KNN model over, enough to ensure that the accuracy of the model is still high. This method front-loads the work when you create the model so that it is much faster to execute on new data.
To use the data reduction feature, set the dataReduction option to true. By default, the execution stops when the database finds a model that is 90 percent accurate over the training data or when the reduced model increases to 1,000 rows of data. You can override these defaults by setting the maxReducedRows or targetAccuracy options.
Additionally, if you are using data reduction, you can ask the database to collect metrics on the percent of rows correctly classified by the model because this is much faster to do on the reduced data set.
This example builds another KNN model over the same data and uses data reduction to find a model that is at least 95 percent accurate.
CREATE MLMODEL knn2 TYPE K NEAREST NEIGHBORS ON (
SELECT x,
y,
label
FROM mldemo.circles
) options(
'k'->'3',
'dataReduction'->'true',
'metrics'->'true',
'targetAccuracy'->'0.95'
);
Modified 0 rows
Query the machine_learning_models and k_nearest_neighbor_models system catalog tables to see the snapshot table used for the model and its accuracy.
SELECT table_name,
correctly_classified
FROM sys.machine_learning_models a,
sys.k_nearest_neighbor_models b
WHERE a.id = b.machine_learning_model_id
AND name = 'knn2';
table_name correctly_classified
-------------------------------------------------------------------
temp.knn86947930347148676 0.9541
Fetched 1 row
The model with data reduction correctly classifies about 95 percent of the training data. To see the size of the KNN model snapshot, you can query it by its table name.
SELECT COUNT(*) FROM temp.knn86947930347148676;
count(*)
--------------------
300
Fetched 1 row
The snapshot is only 300 rows. Without data reduction, the row count would be much larger by orders of magnitude.
For details about using this model type, see K Nearest Neighbor Classification.
Multi-Class Classification with Neural Networks
The FEEDFORWARD NETWORK model type can also perform multi-class classification when the feature classes are numeric. Multi-class classification works by using vectors, where each position in the vector indicates the score for that class. The argmax value of the score of each vector returns the class index.
Much of this example is similar to the previous Binary Classification with Neural Networks tutorial, with these modifications to the model options:
- Set the
lossFunction option to cross_entropy.
- Set the
outputs option to specify the number of classes for the model to use (in this example: 3).
- Set the
useSoftMax option to true, which applies a softmax function to the final vector.
When the loss function is cross_entropy_loss , the option useSoftMax defaults to true.
CREATE MLMODEL crossentropy TYPE FEEDFORWARD NETWORK ON (
SELECT x,
y,
CASE
WHEN label = 'INSIDE' THEN { { 1.0,
0.0,
0.0 } }
WHEN label = 'MIDDLE' THEN { { 0.0,
1.0,
0.0 } }
ELSE { { 0.0,
0.0,
1.0 } }
END AS TARGET
FROM mldemo.circles
) options(
'metrics'->'true',
'hiddenLayers'->'2',
'hiddenLayerSize'->'2',
'outputs'->'3',
'lossFunction'->'cross_entropy_loss',
'useSoftMax'->'true'
);
Modified 0 rows
Execute this model to return a vector.
SELECT crossentropy(5,5) AS predicted;
predicted
--------------------------------------------------------------------------------
[[1.3436428911280516E-7, 1.811726873154417E-7, 0.9999996844630236]]
Fetched 1 row
To get the classification, use the VECTOR_ARGMAX function to retrieve the index of the largest value. This function returns the 1‑based index of the largest value in the vector.
SELECT VECTOR_ARGMAX(crossentropy(5,5)) AS predicted;
predicted
--------------------
3
Fetched 1 row
This query finds the overall accuracy of the model.
SELECT count(*) / 10000.0
FROM mldemo.circles
WHERE (
label = 'INSIDE'
AND VECTOR_ARGMAX(crossentropy(x, y)) = 1
)
OR (
label = 'MIDDLE'
AND VECTOR_ARGMAX(crossentropy(x, y)) = 2
)
OR (
label = 'OUTSIDE'
AND VECTOR_ARGMAX(crossentropy(x, y)) = 3
);
(_count(*)_0)/((10000.0))
--------------------------
0.9298
Fetched 1 row
The model is about 93 percent effective. You can improve the accuracy by modifying the Model Options, including:
hiddenLayers
hiddenLayerSize
lossFuncNumSamples
Increasing the resources of these options can improve accuracy, but could increase the time needed to train the model.
For details about using this model type, see Feedforward Neural Network.
Multi-Class Classification with Support Vector Machines and Logistic Regression
Both logistic regression and SVM are binary classifiers, but you can use either to build a multi-class classifier to handle any number of classes. OcientML can perform this automatically with either model.
This example shows the CREATE statement for the SVM model, which uses the same data as the K Nearest Neighbors and Multi-Class Classification with Neural Networks examples.
CREATE MLMODEL circ1 TYPE SUPPORT VECTOR MACHINE ON (
SELECT x,
y,
label
FROM mldemo.circles
) options(
'metrics'->'true',
'function1'->'x1',
'function2'->'x2',
'function3'->'x1^2 + x2^2'
);
Modified 0 rows
This statement uses the same kernel as the example for performing binary classification on Support Vector Machines.
This query checks the metrics in the machine_learning_models and support_vector_machine_models system catalog tables to see the accuracy.
SELECT correctly_classified
FROM sys.machine_learning_models a,
sys.support_vector_machine_models b
WHERE a.id = b.machine_learning_model_id
AND name = 'circ1';
correctly_classified
----------------------
0.9579
Fetched 1 row
This model is slightly more accurate than the neural network classifier.
This example tries logistic regression.
CREATE mlmodel circ2 TYPE LOGISTIC REGRESSION ON (
SELECT x,
y,
label
FROM mldemo.circles
) options('metrics'->'true');
Modified 0 rows
SELECT correctly_classified
FROM sys.machine_learning_models a,
sys.logistic_regression_models b
WHERE a.id = b.machine_learning_model_id
AND NAME = 'circ2';
correctly_classified
----------------------
0.9028
Fetched 1 row
Logistic regression shows lower accuracy than the neural network model.
These classification models can handle non-numeric data.
Naive Bayes
The first of these models is naive Bayes. This model can handle binary or multi-class classification and features of any data type.
By default, naive Bayes treats each feature as a discrete value. For example, if you assign features with values of 1, 2, 3, and 4, the model does not give any numeric significance to them. In this case, this model cannot predict the class for the sample with a feature value of 3.5 because that value is not explicitly assigned as a feature.
However, the naive Bayes model can define certain input features as continuous features. These should be numeric values. In this case, the naive Bayes model finds the best-fit normal distribution for that feature to do the probability calculations.
Naive Bayes models essentially use the probability rules of Bayes’ Theorem to compute the most likely class based on the training data. The rules are only correct if the features are independent (uncorrelated). However, in practice, the models tend to work well with real-world data.
This example uses this simple table to demonstrate naive Bayes models.
SELECT * FROM mldemo.cars;
color type origin stolen
----------------------------------------------------------------------------------------------------------------------------------------------
yellow sports domestic false
yellow sports imported true
yellow suv imported false
red sports domestic false
yellow suv domestic false
red suv imported false
red sports domestic true
yellow suv imported true
red sports imported true
red sports domestic true
Fetched 10 rows
color, type, and origin are the input features, which are all strings. true and false are the class labels. In this case, all of the input features are discrete.
This example builds the naive Bayes model.
CREATE MLMODEL nb1 TYPE NAIVE BAYES ON (
SELECT color,
type,
origin,
stolen
FROM mldemo.cars
) options('metrics'->'true');
Modified 0 rows
This query examines the information in the machine_learning_models and naive_bayes_models system catalog tables for naive Bayes models.
SELECT NAME,
on_select,
num_arguments,
result_probability_table,
feature_result_matrix_table,
correctly_classified
FROM sys.machine_learning_models a,
sys.naive_bayes_models b
WHERE a.id = b.machine_le arning_model_id
AND NAME = 'nb1';
name on_select num_arguments result_probability_table feature_result_matrix_table correctly_classified
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
nb1 select color, "type", origin, stolen from user.nb_input 3 temp.rpt83581840700868656 temp.frm83581860860782808 0.8
Fetched 1 row
In the model, the correctly_classified value indicates the accuracy is 80 percent, meaning just two rows are classified incorrectly. The two table names in the temp schema hold all the necessary data for executing the model as a function.
This query calls the model to classify new data.
SELECT nb1('red', 'suv', 'domestic') AS prediction;
prediction
-----------
false
Fetched 1 row
Here is another example with a different data set to show how continuous features work with naive Bayes models.
SELECT * FROM mldemo.golf;
humidity play
--------------------------
70 false
75 true
86 true
95 false
80 true
85 false
96 true
90 true
91 false
90 false
70 true
70 true
80 true
65 true
Fetched 14 rows
In this case, the only input feature is a continuous feature, but you can have multiple continuous features, and you can have a mix of continuous and discrete features.
To build a naive Bayes model over this data, you must set which input features are continuous in the model.
CREATE MLMODEL nb2 TYPE NAIVE BAYES ON (
SELECT *
FROM mldemo.golf
) options('continuousFeatures'->'1', 'metrics'->'true');
Modified 0 rows
The continuousFeatures option is a comma-separated list (no spaces) of input column indexes that are continuous. Indexes start at 1.
This query checks the correctly_classified value in the machine_learning_models and naive_bayes_models system catalog tables.
SELECT correctly_classified
FROM sys.machine_learning_models a,
sys.naive_bayes_models b
WHERE a.id = b.machine_learning_model_id
AND name = 'nb2';
correctly_classified
----------------------
0.7142857142857143
Fetched 1 row
Naive Bayes is a powerful classification algorithm, but it operates as a black box, making it hard to assess how it makes classification decisions.
For details, see Naive Bayes Classification.
Decision Trees
Unlike naive Bayes, the decision tree model allows you to see how the model works and makes its decisions.
A decision tree branches based on the value of a single feature, while the leaves of the tree are the output classes. For SQL syntax, this model uses multiple nested CASE statements. You can examine the whole nested CASE statement that defines the model, but it might be very long.
The decision tree model in Ocient has limited support for continuous features. The model finds the average value of the feature and allows a two-way branch based on whether the value is lower or higher than the average.
Creating a decision tree model is exactly like a naive Bayes model, except with a different type. Decision trees also use the continuousFeatures option, which works the same way.
These decision tree examples use the same data sets as the naive Bayes examples.
CREATE MLMODEL dt1 TYPE DECISION TREE ON (
SELECT color,
"type",
origin,
stolen
FROM mldemo.cars
) options('metrics'->'true');
Modified 0 rows
This query examines the machine_learning_models and decision_tree_models system catalog tables.
SELECT correctly_classified
FROM sys.machine_learning_models a,
sys.decision_tree_models b
WHERE a.id = b.machine_learning_model_id
AND name = 'dt1';
correctly_classified
----------------------
0.8
Fetched 1 row
This model has the same 80 percent accuracy as the naive Bayes model.
This query examines the CASE statement that defines the model.
SELECT case_statement
FROM sys.machine_learning_models a,
sys.decision_tree_models b
WHERE a.id = b.machine_learning_model_id
AND name = 'dt1';
case_statement
---------------------------------------------
CASE
WHEN x3 = cast('domestic' as CHAR) THEN CASE
WHEN x1 = cast('red' as CHAR) THEN CASE
WHEN x2 = cast('sports' as CHAR) THEN cast('true' as BOOLEAN)
ELSE cast('false' as BOOLEAN)
END
WHEN x1 = cast('yellow' as CHAR) THEN cast('false' as BOOLEAN)
ELSE cast('false' as BOOLEAN)
END
WHEN x3 = cast('imported' as CHAR) THEN CASE
WHEN x2 = cast('suv' as CHAR) THEN CASE
WHEN x1 = cast('red' as CHAR) THEN cast('false' as BOOLEAN)
WHEN x1 = cast('yellow' as CHAR) THEN cast('true' as BOOLEAN)
ELSE cast('false' as BOOLEAN)
END
WHEN x2 = cast('sports' as CHAR) THEN cast('true' as BOOLEAN)
ELSE cast('false' as BOOLEAN)
END
ELSE cast('false' as BOOLEAN)
END
Fetched 1 row
Like with the naive Bayes model, the decision tree model can predict new data.
SELECT dt1('red', 'suv', 'domestic') AS prediction;
prediction
-----------
false
Fetched 1 row
This example builds a model over the golfing data.
CREATE mlmodel dt2 TYPE DECISION TREE ON (
SELECT *
FROM mldemo.golf
) options('continuousFeatures'->'1', 'metrics'->'true');
Modified 0 rows
Review the machine_learning_models and decision_tree_models system catalog tables to see the accuracy of this model.
SELECT correctly_classified
FROM sys.machine_learning_models a,
sys.decision_tree_models b
WHERE a.id = b.machine_learning_model_id
AND name = 'dt2';
correctly_classified
----------------------
0.7142857142857143
Fetched 1 row
Again, the decision tree shows similar accuracy to the naive Bayes model.
Machine Learning Model Functions
Classification Models