Adagrad
This method adapts the learning rate to each parameter. Large updates for infrequent params, small updates for the frequent ones.
Where:
- = smoothing parameter
- all other equations and parameters are as explained in NAG
Pros:
- Works well for sparse data
- improves robustness of SGD
Cons:
- Due to accumulated squared gradients, it can shrink the learning rate very badly
See also: gradient-descent nesterov-accelerated-gradient
References: