To minimize the issues of local-optima convergence of gradient descent, one can use momentum.

In other words, if the gradients keep changing in direction every iteration, this will dampen such oscillations. And if gradients don’t change too much, this will further enhance their update strength.

$v_t = \gamma v_{t-1} + g_t$
$\theta_{t+1} = \theta_t - v_t$

Pros:

reduces updates for dimensions whose gradients change directions often
increases updates for dimensions whose gradients point in the same directions

Where:

$v_t$ = momentum update for the current time step
$\gamma$ = momentum term. Another hyper-param to be tuned.
all other equations and parameters are as explained in gradient descent