
In a previous article of How to fit lines Using Least Square Method? We discussed how the Least Square Method can be used to fit lines. In this article we are going to discuss another optimization method called Gradient Descent.
Gradient Descent is an effective tool for adjusting a model’s parameters with the aim to minimise cost function. Cost Function is equal to sum of squared residual. In simple words the cost function is nothing but the square of difference between the observed value and the predicted value.
How does it work?
Before we start, let me introduce you to my friend Raj. Raj is blind and he lives in a 2-dimensional world. He was happily living in a valley and one day , someone took him to the mountains and now he has lost that person. Now Raj wants to go all the way down to the valley. Raj only has one stick in his hand to navigate. So how will he descend towards the valley?
Despite living in a 2-dimensional world, Raj knows gradient descent. He first puts his stick on the right and finds that the ground on the right is slightly higher than the ground where he is standing. Then he puts his stick on the left and finds that the ground on the left is slightly lower than the ground where is standing. Since Raj wants to go down the hill he moves one step towards the left. After moving there he repeats the same process until he reaches the valley.
Now Let’s assume that Raj lives in a 2-dimensional world where he has the power to manipulate his size. Using his power Raj increases his size and decides to go down the hill thinking that a big step might help him reach the valley quicker.
But after reaching a particular height he started going up and then again down but he never reached the valley.
At this point he knows that something is going wrong since he hasn’t reached the valley yet. He realises that the problem is because of his large steps which is not getting him anywhere near the valley. He also knows that small steps may take much more time to reach there.
So he comes up with an idea: Initially he starts with a large step and with every step towards the valley he decreases the step size. So in this way he finally reaches the valley.
Does this example have anything to do with adjusting the model’s parameter?
In this article we already saw that in fitting lines our ultimate goal is to minimise cost function. In the Least square method we used an analytical approach to find the optimal values of the model’s parameters. In Gradient descent we will use an algebraic approach.
We know that in linear regression we use straight lines to describe the relationship between the two variables. We do this by finding optimal values for intercept and slope which minimises the cost function. Assume that we have an optimal value of slope. Now let’s see how we can use Gradient descent to find the optimal value for intercept. To get started we can randomly choose any value for intercept. Here, let’s say that the initial intercept is equal to zero.
We have drawn a line passing through zero with some slope say s1. Now let’s calculate the sum of squared residuals for intercept I1.
Similarly we can do this for various values of intercept.
After finding sum of square residual for various values of intercept we get a graph which looks like this:
Now we have to find the global minima because there the sum of squared residuals is minimum. Before you get lost let me remind you that the minimum sum of square residuals gives the line which best fits.
Now let’s use Gradient Descent. The story which I told about Raj goes here. This is what Raj was processing in his mind when he was finding the valley. Raj used a stick to find the slope but we will use a derivative to find the slope. Well don’t get scared when I say derivative . Below diagram shows what the derivative actually is.
We take derivatives at every point on the curve and decide where to move i.e. left or right. If the slope is negative we go down and if it’s positive we go up.
The rate at which we move down to the minima is called step, like Raj’s superpower that he used when he was moving down the valley. Here we can see that as we move from one end to another the slope changes drastically. This means that the step is dependent on slope and is given by slope multiplied by learning rate. Learning rate is to be decided by us. It should be chosen carefully as larger learning rate might result in larger steps and small learning rate may take forever to get to minima.
We always consider some limited number of steps to find the minima. If the number of steps gets completed before finding the minima then the algorithm will stop. This puts some upper limit or the algorithm will run forever. But If the algorithm finds minima before the number of steps are completed then the algorithm stops and gives the optimal line.
Once we get the minimum value we take the intercept corresponding to it and put it in the equation to find the line that best gives the relation between two variables.
Now In the above example we only took an intercept to find the minimum sum of squared residual but in the actual world we also need to consider the slope for which we get the minimum sum of squared residual. To find this we have to repeat the same procedure but using intercept and slope as well. We need to find the partial derivative of function with respect to intercept and slope which gives slope of curve at every point.
Summery:
We Now know that Gradient descent optimises two parameters. But if there are more than two parameters we can still use gradient descent.If we were using least squares to solve for optimal value, we would simply find where the slope of the curve is equal to zero. Whereas gradient descent finds the minimum values by taking steps from initial guess until it reaches the best value.
Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]