Momentum Gradient Descent
To minimize the issues of local-optima convergence of gradient descent, one can use momentum.
In other words, if the gradients keep changing in direction every iteration, this will dampen such oscillations. And if gradients don’t change too much, this will further enhance their update strength.
Pros:
- reduces updates for dimensions whose gradients change directions often
- increases updates for dimensions whose gradients point in the same directions
Where:
- = momentum update for the current time step
- = momentum term. Another hyper-param to be tuned.
- all other equations and parameters are as explained in gradient descent
See also: gradient-descent
References: