Adamax Same as Adam, but uses the norm in the running average of past gradients. OR Where: all other equations and parameters are as in Adam See also: adam References: http://ruder.io/optimizing-gradient-descent/index.html#adamax