This method adapts the learning rate to each parameter. Large updates for infrequent params, small updates for the frequent ones.

Where:

  1. = smoothing parameter
  2. all other equations and parameters are as explained in NAG

Pros:

  1. Works well for sparse data
  2. improves robustness of SGD

Cons:

  1. Due to accumulated squared gradients, it can shrink the learning rate very badly