The Right and the Wrong Way to Do Cross-validation
You might wonder why do we need cross validation in the first place itself. Let’s explain that first. Normally, the generalization performance of a machine learning algorithm depends on its prediction capability on an independent test data. This assessment is of utmost importance to us.
Cross Validation is such a model validation technique to verify the classifier performance of a statistical analysis method. It is mainly used where our primary goal is prediction and we want to estimate how accurately the model will perform in practice.
For a model, three sets of data are generally used:
- The training set is used to fit the model.
- The validation set is used to estimate prediction error for model selection.
- The test set is used for assessing the generalization error of final chosen model.
In cross validation, the dataset given is divided into training part and test part (i.e. a dataset is defined to “test” the model in the training phase which is our validation set which gives an idea how the model will generalize to an independent dataset).
There are different cross validation methods available such as Hold-out method, Leave-one-out method, K-Fold cross validation etc. I am not going into details of these methods as it distracts us from the title of the article.
Cross validation can be used for feature selection. Let us take an example.
Say we have a dataset with 1000 features and 100 samples in it. Normally, a common strategy for feature selection with cross validation would be as follows:
- Find the best 20 features subset that show fairly strong correlation.
- Using this subset of 20 features, we build a multivariate classifier.
- Then use cross validation to estimate prediction error of the final model.
Is this the correct way? No. Let us explain why.
The problem is that the predictors have an unfair advantage, as they were chosen in step 1 on basis of all samples. As given in the book The Elements of Statistical Learning – Tibshirani, leaving samples out after the variables have been selected does not correctly mimic the application of the classifier to a completely independent test set, since these predictors “have already seen” the left out samples.
In our example, we chose best 20 features from 1000 features using 100 samples. Then we use 10-fold cross validation to divide dataset into 10 subsets of 10 samples each and then predicted error over these subsets using preselected 20 features.
Instead, the correct strategy would be following:
- Divide the dataset into 10 subsets of 10 samples each as in 10-fold cross validation.
- For each group, k = 1, 2….10.
- Now, find best 20 features using all of the samples except that in group k.
- Using these features, create a multivariate classifier again using all samples except in group k.
- Now, use this classifier to predict error in the group k.
Take Away
The difference in the two strategies was that the samples on which the classifier (i.e. the k group) is to be run should be left out during the feature selection step. This ensures that the predictors are not biased and the prediction would be natural.
This problem goes unnoticed as it seems obvious to the reader, but should be carefully taken into account as the difference in predicted error rate in the two strategies could be as large as 50 percent!
Interested in knowing more on such niche techniques? Check out http://research.busigence.com