Regularisation Methods to Perform Regression.

June 3, 2022

In the article where we discussed How to fit lines Using Least Square Method ,we used a simple residual error which is nothing but the difference between the observed value and the predicted value to fit the line. We try to minimise the error to get the optimal line.

When we have more data points this method is feasible but when we have less data points then this method may lead us to overfit. To avoid overfitting we can use the regularisation method or shrinkage method.

img_1

img_2

In this article we will be discussing Ridge Regression, Lassos Regression and Elastic Net Regression.

Ridge Regression:

Consider the above image where we have just two points. In that image the sum of squared error is zero. But when we compare it with the original data we can see that the line passing through two lines is not the optimal line.

In the least squares method we find the sum of squares of error and then try to determine the value of coefficient that minimises the error. Rigid regression follows the same approach except for the part where it penalises the variable which is not important and shrinks down the overall factor. To understand this, consider the below equation:

[(Observed value) – (Predicted value)] + [lambda * slope*slope]

First part of the equation is similar to the least squares method. Second part of the equation tries to shrink the coefficients of the variables so that the entire equation becomes minimum. This is the part which decides whether the variable is important or not. This is called the shrinkage penalty. Without lambda the second part of the equation will make all the coefficients reach the zero mark, damaging the model. Lambda controls the amount of penalty needed to be applied to variables.

Again Consider the second image. Using the least square method the model overfits resulting in this image.

img_2

In this image we can see that the line perfectly passes through two points. Slight change in the value of x causes drastic change in the value of y. This means it has a high slope. This model has high bias and low variance.

Now consider the below image. Here we can see that the line does not overfit and the slope of the line is less than the previous line. Here the model has high variance and low bias.

img_3

As the value of lambda increases the slope of the line decreases. The slope of the line does not go to zero even at a high value. Only when the lambda is at infinity does the slope become zero.

img_4

So basically as lambda increases, the value on the Y axis becomes less dependent on the X axis. Hence the model has low bias and high variance.

Lasso Regression:

LASSO stands for Least Absolute Shrinkage and Selection. In Ridge regression the useless features are not discarded despite being penalised. But in Lasso regression these variables are dropped. Lasso regression is a combination of Ridge regression and subset selection method [In subset selection method we use various feature selection techniques to filter out the unwanted features.].

Following assumptions are required to be followed by data sets which are similar to simple regression.

  • All the predictor variables must be independent of each other.
  • There must be some kind of conditional dependence between the predictor and the predicted variables.
  • All the independent variables must be standardised.

The equation for Lasso regression is given as:

[(Observed value) – (Predicted value)] + [lambda * slope*slope]

{ modulus(slope) => |slope|, which implies even if the slope is equal to some positive or negative value its modulus will always be positive.}

So looking at the above equation we can observe as the lambda becomes zero, the overall equation becomes equal to the sum of squared error. As the lambda increases, more coefficients are set to zero and the useless features are eliminated, thus increasing the bias.

As lambda increases slope decreases just as we saw in Ridge regression. But in this case we can see a kink at zero. We can see as the lambda increases the kink at zero becomes more sharp. This means that with the increase in lambda the variance increases too.

img_5

Lasso regression solves disadvantages of ridge regression and subset methods. It is considered a good regression model but it also has some disadvantages.

Lasso fails when the number of observations is less than the number of variables, because it can only select n variables before it saturates. If there are variables having high correlation between them, Lasso selects only one of them, leaving the other as it is and this can cause issues. Lasso’s performance is highly dominated by Ridge regression if there is a very high correlation between predictors. All of the above disadvantages are solved by ElasticNet Regression.

ElasticNet Regression:

Elastinet automatically chooses the variable, performs continuous shrinkage, and also selects from the group of highly correlated variables. Mathematically it is defined as

[(Observed value) – (Predicted value)] + [lambda1 * slope*slope]

[lambda2* modulus(slope)]

ElasticNet has two parameters instead of just one and these parameters are used for shrinkage and together they are termed as the ElasticNet penalty.

Above equation is further simplified as:

[(Observed value) – (Predicted value)] +lambda { [alpha * slope*slope]

[(1 – alpha) * modulus(slope)]}

Where alpha has value between 0 and 1.

If alpha is equal to one, ElasticNet performs like Ridge regression and if alpha is equal to zero, it performs like Lasso regression.Between zero and one, it has a combined effect of ridge and LASSO regression.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]