Question 1

What is the algorithm pattern for Adam Optimizer?

Accepted Answer

Adaptive Moment Estimation: Adam keeps a running average of gradients (m = momentum) and squared gradients (v = RMSProp). Bias correction adjusts for the cold start at t=0.

Question 2

How do you solve Adam Optimizer step by step?

Accepted Answer

m_t = β1·m_{t-1} + (1−β1)·g  — exponential moving average of gradient. v_t = β2·v_{t-1} + (1−β2)·g²  — exponential moving average of squared gradient. Bias-correct: m̂ = m/(1−β1^t),  v̂ = v/(1−β2^t). Update: w = w − lr·m̂ / (√v̂ + ε). Large gradients → large v̂ → small step (auto-scaling per parameter).

Question 3

What are common mistakes when solving Adam Optimizer?

Accepted Answer

β1=0.9 and β2=0.999 are the standard defaults used in almost all papers. The ε (1e-8) prevents division by zero, not just numerical stability. Adam can fail to generalize on some tasks — SGD with momentum sometimes beats it.

Adam Optimizer — Step-by-Step Visualization

Algorithm Pattern

Key Idea

Step-by-Step Approach

Common Gotchas

Related Problems