Let’s assume that you have never seen a train in your entire life and now we are in the perfect position to play a game. Well the game goes by the name of ‘Find the train’. So I will show you one photo of the train and then I’ll show you different photos and you have to guess which image has the train. So let’s begin, here is the photo of the train.
So you can see in the image a long vehicle appearing like a snake is a train. Now below are two images. One has a train and the other doesn’t.
You must have guessed it right in a snap of time. But how did you do it?Did you notice when I told you to focus on snake-like structures you immediately filtered other things from the image like trees, houses etc. You focused on just one thing that is a snake-like structure and registered that our train has a similar structure. You reduced the feature of the image which was indeed making the image beautiful but this was not our concern. Our main aim was to spot the train. So you told your brain to register just that thing and discard other useless features.
Alas Machines are not that smart. They need to be trained with millions of data points to perform this same task with 95% accuracy. Consider the dummy dataset below.
Column K is output and remaining columns are input. In a previous article we saw that too many features are nothing but curses. Also, we can see from the table that many features are similar. There may be many features which may not be useful so we must drop those features.
Not removing useless features is like feeding noise or garbage to our machine learning model. It reduces training time, avoids overfitting and improves accuracy if the right subset is chosen. The process of removing redundant, irrelevant, or otherwise useless features from a dataset that saves time is called Feature selection.
In this article we will be discussing a few models of feature selection. Supervised and unsupervised models are two models of feature selection. Supervised models are further divided into Intrinsic method, filter method and wrapper method.
In supervised models, the target variable is considered to remove redundant variables. Input is specifically selected to increase the accuracy of the model or to reduce the complexity of the model. Here outcome is used to quantify the importance of input variables.
Whereas in an unsupervised model only input variables are considered. Further under the Supervised Model we will go through the filter method and wrapper method. Let’s go through them one by one.
In the dataset there are various redundant features which must be removed. So in the filter method we use a single column as input and see if there is any relation between feature and target, and correspondingly we calculate the score using statistics measures which is to be assigned to each feature. We repeat the process for each column and assign the score to each feature. After calculating the score for each feature we rank each feature according to their score and set the threshold value. We then remove features having score less than the threshold value.
These methods are often univariate and consider features independently or with respect to some dependent variable.
Statistical measures used in filter methods are Pearson’s Correlation, Spearman’s correlation, ANOVA, kendall’s, Chi-Squared and Mutual Information. The below tree diagram shows the use of each measure according to input variable and output variable.
This is how a tree diagram should be read- If the input variable is numerical and the output variable is also numerical then the statistical measure used is Pearson’s Correlation.
Pearson’s correlation is a measure of the strength of association between the two variables. It is used as a measure for quantifying linear dependence between two continuous variables X and Y. Its value varies from 1 to -1.This is a formula to calculate Pearson’s correlation.
Here X, Y = Variables,▁X,▁Y = Their respective means
Consider the Boston housing dataset below.
Using python you can calculate the correlation between the variables. So below is the correlation between variables calculated for the above dataset. It is displayed using a heatmap. Note CAT. MEDV is the target variable and others are input variables.
So from the above map we can see that MEDV has a strong correlation with the Target variable i.e 0.79 which is followed by RM. If we set the threshold as 0.45 then except MEDV, RM, and LSTAT all the other variables are to be dropped.
Now if we look carefully then RM and MEDV share a strong correlation. That means we can drop RM or MEDV from the feature as they affect the target variable in the same way. That means mathematically they are the same variable.
So which variable is to be dropped?
Look again at the heatmap and find which variable has a stronger correlation with the target variable. In this case MEDV has a stronger correlation with the target variable CAT. So we will drop RM and train or model using only two variables i.e LSTAT and MEDV.
This is how we use pearson’s correlation in feature selection.
Let’s consider a dataset that contains many features. Now in the wrapper method we feed different combinations of these features to the machine learning algorithm and note the accuracy and error for those features. The features which predict correct output with maximum accuracy and minimum error are kept and the rest are discarded.
Under the wrapper method we will discuss forward feature selection and backward feature selection.
Consider a feature subset F. Initially F is empty. We have some dataset that contains let’s say ‘n’ features. So initially we start with an empty feature subset F. Now we feed the 1st feature from feature space to the machine learning algorithm and note the error. If the error is below threshold then we add the feature to feature subset F else we drop the feature.
Similarly we do this for n features from feature space. So the feature subset F has only those whose error was below the threshold value. After that we select a feature which has less error and along with this we use a combination of different features from subset F. Now we feed them to a machine learning algorithm and look for the combination which performs well. Again we note the error and remove those combinations which give more error. In this way we remove unimportant features from the dataset. This is how forward feature selection works.
In backward feature selection we start with features subset F containing all the features from the dataset. We feed these features to the machine learning model and keep the features which give the best evaluation measure and remove the rest of the features.
If we look at the differences between filter method and wrapper method, the former involves machine learning algorithms to remove unwanted features whereas the latter uses machine learning algorithms.
The Filter method is faster than the wrapper method. Filter method is less prone to overfitting as compared to the wrapper method.
Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]