momentum gradient descent performs big jumps due to momentum. To avoid this, NAG first computes the gradient and then makes a big jump.

NAG first makes a big jump in the direction of the previously accumulated gradient, which is . It then measures where it ends up and accordingly makes a correction.

Where:

  1. all other equations and parameters are are as explained in momentum gradient descent