This method adapts the learning rate to each parameter. Large updates for infrequent params, small updates for the frequent ones.


  1. = smoothing parameter
  2. all other equations and parameters are as explained in NAG


  1. Works well for sparse data
  2. improves robustness of SGD


  1. Due to accumulated squared gradients, it can shrink the learning rate very badly