In the past few articles, it was stressed many times that the human brain is very good at recognizing things once they see it. For example if I show a picture of an apple to a 3 years old child and then ask him to classify the picture of the apple he could do that job with ease. He will not classify a tomato as an apple.
But when It comes to machine learning models we have to train models with thousands of images of apples that are clicked at various angles and still there is one or two percent chance that our machine learning model will classify a tomato as an apple.
We always use a large data set to train our machine learning model to achieve the desired accuracy. But there might be a situation when we would be short of data points. Like consider a hypothetical situation where we are exploring the ocean for some rare species of sharks. We want to build a classifier that we would install in the submarine which would identify different species in the ocean. Since it is a rare species it is obvious that we won’t be having many images of the shark. So we want our machine to learn everything possible from every image. We cannot afford to put a few images aside.
Now what will happen if we use all the images for training is that we won’t be having a testing set to check the accuracy of the machine. We cannot use the training set as a testing set because it will be like cheating and we cannot know the accuracy of our model. This difficulty can be resolved by using a technique called cross-validation or rotational validation.
The core idea here is that we split that data set into a temporary training and testing set and train our model on a temporary training set and evaluate the model on a temporary testing set. After noting the score, we again split the data in different temporary training and testing sets. We again train the model and note the score after evaluating using a testing set. After repeating the above step for sufficient numbers of time we take the average of all the scores which denotes the performance of our model.
As an example let’s assume that we have 100 images of that rare shark species. So we now split that dataset into let’s say 4 parts. In the first go we use the last part of the data set as a testing set and use the 1st three parts of the dataset for training. We train the model and check the accuracy using a testing set. We note the accuracy and then select the 3rd part of the dataset as a testing set and use the remaining part for the training set. Using this we again follow the same procedure until one by one we have used all the parts as a testing set. The method we just discussed is called the K-fold cross-validation method.
There are two types of cross validations.
- Exhaustive cross-validation
- Non-exhaustive cross-validation.
In this method the dataset is split as training and validation set in various ways and the model is trained and evaluated using each possible combination of the training and testing set.
Further Exhaustive cross-validation consists of Leave-p-out cross-validation and Leave-one-out cross-validation.
In Leave-p-out cross-validation, validation sets consist of p observation and remaining observations are used for training sets from the original dataset. Whereas in Leave-one-out cross-validation validation set consists of only one data point i.e. p=1.
Both the techniques require more computational time but leave-one-out cross-validation requires less computational time than Leave-p-out cross-validation as Leave-p-out cross-validation require training and validating the model Cnp times whereas Leave-one-out cross-validation requires training and validating the model n times. Where n is the total number of observations.
Non-exhaustive cross validation methods do not compute all the ways of splitting the original sample. This technique is an approximation of Leave-p-out cross-validation.
Non-exhaustive cross-validation consists of K-fold cross-validation, holdout method and repeated random sub-sampling validation.
We have already gone through K-fold cross-validation. In the holdout method we randomly assign data points to training and testing sets. The size of both sets is arbitrary but usually the training set is bigger than the testing set. In typical cross-validation, results of multiple runs of model-testing are averaged together; in contrast, the holdout method, in isolation, involves a single run.
In repeated random sub-sampling validation also known as Monte Carlo cross-validation, multiple random splits of training and testing sets are created. Then the model is trained and evaluated using a training and testing set. The results are averaged over the split. The proportion of the training or validation split is independent of the number of partitions.
When our dataset is small we should use a cross-validation technique to predict the accuracy of our model. Few techniques we discussed here are Leave-p-out cross-validation, Leave-one-out cross-validation, K-fold cross-validation, holdout method and Repeated random sub-sampling validation. Apart from these techniques some other techniques are: Stratified k-fold cross-validation, Time Series cross-validation and Nested cross-validation.
Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]