# Optimising Your Training and Testing Set For ML Algorithms

March 12, 2022

In the last few articles on various occasions we trained our model using a training set and checked whether our model had learned anything using the testing set. But we didn’t discuss how our training and testing set must look like, about their importance, process of training and many more things related to it . In this article we will discuss the Training Data and Testing Data.

So let’s go through them one by one.

## Training Data:

Assume that you are learning mathematics in your school. School is a place where you learn, train yourself, update or enhance your skill. In this case you have been training to solve certain mathematical problems. You solve various solved examples in the textbook, go through different methods to solve, train yourself to recognise the problem and accordingly use a suitable method to solve the problem. The solved examples are nothing but the training data.The problem is the input and the right answer is the output. Here you don’t guess the answer but try to find the right answer. Every input problem has a valid right answer and you are supervised to find the right answer.

The collection of all the samples we’re going to learn from, along with their labels (answers), is called a training set. Training set generally consists of many diverse examples. Using a training set our model learns to guess the output. If the output guessed is right we move on to the next data point else we feed the right output and continue the process. Below is the flow chart which shows how models are trained-

From the training set we feed the training set sample to the model and it tries to predict the right output or label ( I will use output and label interchangeably as they both imply the same thing). If the output is correct then we go on to the next training set sample and if the output guessed is wrong then we update the output with the correct one and move on to the next training set sample. As the training of the model progresses there are some internal variables which help the model to predict the right label. This gets updated with every right and wrong prediction.

Each time we run through a complete training set, we say that we have trained for one epoch. We usually run through many such epochs so the system sees every sample many times.

After training the model it’s time to check if our model is predicting labels accurately. To do this we need a testing dataset.

## Testing Data:

Before we begin, here is a story. Once upon a time, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks. So researchers trained a neural network using standard supervised learning techniques. They used 200 photos, 100 unique training dataset and 100 testing dataset. In both training and testing datasets 50% of the photos contained camouflaged tanks in trees and another 50% of photos contained trees with no tank. The researchers ran the neural network on the remaining 100 photos and without further training the neural network classified all remaining photos correctly. Success confirmed!

The researcher’s handed the finished work to the Pentagon, who soon returned it is complaining that in their own tests the neural network did no better than chance at discriminating photos. It turned out that in the researcher’s data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest.

Hence, we have to be very careful while training machine learning models. Like in the previous example if our training dataset is not diverse then there are chances of overfitting. To avoid this, we need some measure other than performance on the training set to predict how well our system is going to do if we deploy it.

It would be great if there was some algorithm or formula which would have told us how good our model is. But there isn’t. So we have to do this in a traditional way like our scientists do. We need experiments to see what actually happens in the real world. We also must run experiments to see how well our systems perform.

To achieve this we need to give new, unseen data to our model and see how well it does on new, unseen data. This unseen data is nothing but our Testing set.

We never learn from test data. By now you know that more the data points in the training set, the more the accuracy increases. Now you might think that, okay let’s train our model on the entire dataset and then split the dataset into a training and testing set, and then evaluate our model on the testing set. This indeed will give you 100% accuracy but this is what we call cheating. If you deploy your model now It will not give you good results.

It is the same as mugging up all the solutions to a mathematics problem and getting good marks in final exams, and then failing in all the entrance exams.

If we take the example of our school, testing is similar to final exams in our schools and marks tell us how much we have understood. Of course if we already know the question and its solution beforehand, we will perform well. But again this will be cheating as before where you trained your model for the entire dataset.

For this reason we split the dataset into a training set and a testing set before training our model. So now we train our model on training data and check performance on unseen dataset which replicates the real world i.e testing set. We use the testing dataset only once after the training is over. We must always ensure our model never sees a testing set during training.

The problem of accidentally learning from the test data has its own name: data leakage, also called data contamination, or contaminated data. Always make sure the test data is kept separate and that it is only used once, when training has been completed.

## Summary:

We often split our original data collection into two pieces: training set and testing set. Training set consists of 75 – 80 % of the original dataset and the testing set consists of 25-20% of the original dataset. During splitting, samples are chosen randomly for each set. Most machine-learning libraries offer routines to perform this splitting process for us.