Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Gradient Descent

Tip

Get familiar with the concept of Regression & Curve Fitting

It is about optimizing the model eg: Linear Regression (optimize the Intercept and Slope), Logistic Regression (optimize a squiggle), t-SNE (optimize the clusters) etc

Gradient means derivatives of the loss function with respect to the parameters whereas descent refers to descending the derivatives(slope) to near zero.

Loss Functions

Loss function are those type of functions that are used to measure how wrong the model’s prediction is (measure the difference between the actual value and the predicted value of a model)

Some type of loss functions are: Sum of Squared Residuals, Mean Absolute Error, Huber Loss, Mean Squared Log Error etc.

Working of Gradient Descent

Gradient Descent finds the optimal value (e.g. of value can be derivative of sum of square residuals curve plot with intercept on x-axis to get near zero) by taking big steps when it is far way and taking baby steps when near the optimal value.

Note

Using Least Sqaure method, we just find the slope of the curve (Sum of squared residuals vs Intercept) where it is zero. In contrast, Gradient Descent finds the minimum value by taking steps from a initial guess unit it reaches the best values. Gradient Descent is very useful when it is not possible to solve for where the derivative=0

Learning Rate

The is the rate that determines the step size of the parameter (which influence the finding of the optimal solution) like intercept.

Gradient Descent stops when the step size is very close to 0. And it happens when the slope is near to 0.

Formula:

Step Size = Slope * Learning Rate

Estimating Intercept and Slope using Gradient Descent

First, derivating the loss function (SSR) with respect to intercept and slope. Then, using learning rate increasing both two parameters Intercept and Slope. Once the step size starts getting near to 0 then it is suppose to getting near the optimal solution of both parameters.

Bird’s Eye View