Adam

Same as AdaDelta and RMSProp, stores the running average of past gradients to use them as moment.

$g_t = \frac{\partial J(\theta_t)}{\partial \theta_t}$
$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$

Since $m_t$ and $v_t$ are zero-initialized, they bias towards zero. To correct for this:

$m'_t = \frac{m_t}{1 - \beta_1^t}$
$v'_t = \frac{v_t}{1 - \beta_2^t}$

Final updates are then defined as:

$\Delta \theta_t = - \frac{\eta}{\sqrt{v'_t} + \epsilon} m'_t$
$\theta_t = \theta_{t-1} + \Delta \theta_t$

Where:

$m_t$ = first moment of gradients
$v_t$ = second moment of gradients
$\beta_1, \beta_2$ = decay rates
$\eta$ = learning rate
$\epsilon$ = smoothing parameter

See also: gradient-descent adadelta rmsprop
AKA: ADAptive Moment estimation
References:

https://www.slideshare.net/SebastianRuder/optimization-for-deep-learning