Feature selection using Decision Tree
One of the key differentiators in any data science problem is the quality of feature selection and importance. When we have a lot of data available to be used by our model, the task of feature selection becomes inevitable due to computational constraints and the elimination of noisy variables for better prediction.
Also, for interpretability purposes,
The data scientist inevitably arrives at the point where the extent of a variable’s usefulness needs to be measured.
Some feature selection techniques inspired from certain decision tree inducing algorithms have been mentioned in this article.
The easiest one to understand and probably the most straight forward one is obtaining a feature ranking based on the sum of the improvements in all nodes in which the attribute appears as a splitter (weighted by the fraction of the training data in each node split). The surrogate splitters (used in CART implementation as a way to deal with missing values in the data) are also considered while calculating variable importance which means that even a variable that never splits a node may be assigned a large importance score. This allows the variable importance rankings to reveal variable masking and nonlinear correlation among the attributes. Importance scores may optionally be confined to splitters; comparing the splitters-only and the full (splitters and surrogates) importance rankings is a useful diagnostic.
It is clear that this kind of variable ranking criteria is heavily influenced by the splitting criteria used and is prone to giving different variable rankings in the presence of certain highly correlated features, when run multiple times.
Another fairly intuitive and somewhat naive approach followed for feature selection exploits the greedy nature of the tree growing algorithm (C4.5). As the best splits are performed early while growing the decision tree, only the features which appear till depth d (usually set to 3) in the decision tree constructed by using a small bootstrapped subset (~10%) of the data are considered important.
This is repeated k number of times using a freshly bootstrapped sample each time and it is hypothesized that if a feature in the deeper levels on any one execution of C4.5 is relevant enough, it will finally rise up and appear in one of the top levels of the tree in some other executions of C4.5. We form a union of all the attributes from each run and call this set as the set of selected features.
Another method is to use the Separability of Split Value (SSV) criterion for feature selection. The algorithm for feature selection from a single SSV tree works as follows:
1. T ← the SSV decision tree built for X, Y.
2. For each non-leaf node N of T , G(N) ← the classification error reduction of node N.
3. F ← the set of all the features of the input space.
4. i ← 0
5. While F = ∅ do:
(a) For each feature f ∈ F not used by T define its rank R(f) ← i. Remove these features from F.
(b) Prune T by deleting all the final splits of nodes N for which G(N) is minimal.
(c) Prune T by deleting all the final splits of nodes N for which G(N) = 0.
(d) i ← i + 1
6. Return the list of features in decreasing order of R(f).
As the decision tree building algorithm selects the splits locally, i.e. with respect to the splits selected in earlier stages, so that the features occurring in the decision tree, are complementary. In some cases, the full classification decision trees use only a small part of the features. It does not allow to select any number of features – the maximum is the number of features used by the tree, because the algorithm gives no information about the ranking of the rest of the features. Also, certain good masked features suffer due to this type of strict ranking criteria.
As we move on to some slightly more complex techniques, the ensembles of trees play an important role. Another fairly intuitive algorithm for obtaining variable importance using Random forests first grows a complete forest using a bootstrapped sample containing on an average 63% of the unique data. Then, for evaluating the variable importance, the OOB (Out Of Box) data is run down each decision tree of the ensemble, making a random assignment to one of the child nodes if the split is on the variable under consideration.
The corresponding prediction error of this noised up ensemble is noted which gives us the increase in the prediction error by “noising up” the variable v. This can be used to find out the relative importance of each feature. One caveat, however is that this sort of variable importance is dominated by the majority class. Also, the importance of highly correlated variables is overestimated by this technique of variable importance.
Take Away
Although these techniques describe above don’t compete well with the some of the best feature selection techniques developed till date, they still work well in practice to give a set of certain extremely useful features which result in sometimes a significant increase in the predictive accuracy of a model. Also, these techniques are helpful to gain insights about the nature of the problem and ease out the interpretation due to variable ranking.
Interested in knowing more on such niche techniques? Check out http://research.busigence.com