Adagrad

This method adapts the learning rate to each parameter. Large updates for infrequent params, small updates for the frequent ones.

$G_t = \Sigma_k g_k^2$
$v_t = \frac{\eta}{\sqrt{\epsilon + G_t}} g_t$

Where:

$\epsilon$ = smoothing parameter
all other equations and parameters are as explained in NAG

Pros:

Works well for sparse data
improves robustness of SGD

Cons:

Due to accumulated squared gradients, it can shrink the learning rate very badly

See also: gradient-descent nesterov-accelerated-gradient
References:

https://www.slideshare.net/SebastianRuder/optimization-for-deep-learning