Adam
Same as AdaDelta and RMSProp, stores the running average of past gradients to use them as moment.
Since and are zero-initialized, they bias towards zero. To correct for this:
Final updates are then defined as:
Where:
- = first moment of gradients
- = second moment of gradients
- = decay rates
- = learning rate
- = smoothing parameter
See also: gradient-descent adadelta rmsprop
AKA: ADAptive Moment estimation
References: