loading

Adam Optimizer — Step-by-Step Visualization

mediumAIMLOptimizationGradient Descent

Step through Adam — watch first and second moment estimates adapt the learning rate per parameter, combining momentum and RMSProp.

Algorithm Pattern

Adaptive Moment Estimation

Key Idea

Adam keeps a running average of gradients (m = momentum) and squared gradients (v = RMSProp). Bias correction adjusts for the cold start at t=0.

Step-by-Step Approach

  1. m_t = β1·m_{t-1} + (1−β1)·g — exponential moving average of gradient.
  2. v_t = β2·v_{t-1} + (1−β2)·g² — exponential moving average of squared gradient.
  3. Bias-correct: m̂ = m/(1−β1^t), v̂ = v/(1−β2^t).
  4. Update: w = w − lr·m̂ / (√v̂ + ε).
  5. Large gradients → large v̂ → small step (auto-scaling per parameter).

Common Gotchas

  • β1=0.9 and β2=0.999 are the standard defaults used in almost all papers.
  • The ε (1e-8) prevents division by zero, not just numerical stability.
  • Adam can fail to generalize on some tasks — SGD with momentum sometimes beats it.

Related Problems