Machine Learning: CART

Decision tree nodes split into disjoint feature subspaces. The subspace represents a prediction. Multiple subspaces can end up with same prediction. The idea is to reach a leaf which is pure.

So we need to know, the depth, value of K of a K-ary tree, threshold values for each splitting node and values of the leaf node. CART stands for Classification And Regression Tree.

For classification minimization of error starts with calculating probability of a class in a region. The frequency of which can be regularized. So first question is to how to split the root node, which feature to pick, generally the one with minimum threshold and maximum impact on whole training set. The process of selecting the right feature is called information gain. Higher the information gain, better the feature to select. Gaining information reduces uncertainty.

To measure uncertainty most popular techniques are Gini Index and Entropy. CART uses Gini Index. The misclassification error ranges between 0 to 1 and is at the peak at 0.5 for K=2.

The algorithm for CART is very cumbersome, we split on each feature for each possible value and then qualitatively verify the effectiveness of the split by weighted sum of uncertainty. Continue recursively until a leaf is reached. The weighted average on each branch is also known as conditional entropy.

With split comes the question of bias and variance. We do not want to underfit while keeping the branch minimal. But high depth can increase complexity while it will create high variance. Max depth is a hyperparameter for a CART. In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training. The other method is to creating a very over fit tree and then prune it.

CART is known to be compact, so small training set sufficient to model. Once learning is done, unlike k-NN, the training instances are not needed further. It is also used as an ensemble for other methods, most popularly in random forest. The problem is it is heuristic. Minimization of error while finding subspaces is a known NP-hard problem. To limit theoretical complexity, we resort to greedy approaches.



Ensemble:




To increase accuracy and reliability of model we can combine two or more model at different levels of our analysis. We can also do training and cross-validation with different models and based on expected metrics to measure performance we can choose the right model for a particular task. Nature of data can be another thing that we need to consider. If there is limitation in data, that can limit the use of certain model also. Popular examples include Random Forest or XGBoost. In our lecture, we found that you can control the hyperparameter maximum depth and then you can further analyse the newly created leaf nodes with a different model other than CART.




Hyper-parameters:


We are currently looking at 4 to 5 different projects. One of them involves, can risk factor of certain tumors and melanoma diseases be confined to a particular location. We can start with one continent as root node and then start splitting or we can start with symptoms and then start splitting further on location. Max Depth can be a factor depending how localized we want to make our prediction. Should we stop at southern North America or keep splitting until we reach Laredo, a border city of Texas? Same can be done for other continents.


As you can already see this will be a challenging task to determine the proximity by location and tree can start to get too complex with high branching. Overfitting certainly be the main problem for a certain location where all symptoms lead to the assumption of a patient belonging to the vulnerable group, although some with part of the symptoms from the same location may not actually fall into vulnerability. (Similar idea for localized patients of a nuclear radiation site, not all will be cancer victims ultimately.)


Overfitting happens due to noise and lack of training examples. Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task. So, if Laredo is absent during training and we encounter a patient of Laredo in test data now model is confused.


Techniques include using hyperparameters for estimating generalization errors. The approaches can be optimistic, pessimistic or reduced error pruning. We can use Occam's razor to prefer the simpler model over a more complex model while both have same generalization error.


Pre-pruning or early stopping rule halts a tree to fully grow if either all labels are same or all features are same. More restrictive condition can be, further split does not reduce impurity or a threshold is never reached. Another technique is post-pruning. Let the tree fully grow and then prune bottom-up to improve generalization error. Class label of leaf node is determined from majority class of instances in the subtree. Some post pruning methods need an independent data set: “Pruning Set”


Reduced error pruning uses repeating of prune at node with largest gain until only negative gain nodes remain, that means pruning happens only if it does not contain a subtree with lower error.





Comments

Popular Posts