Question 1

What is the algorithm pattern for Layer Normalization?

Accepted Answer

Per-Sample Feature Normalization: LayerNorm normalizes across features for each sample independently — unlike BatchNorm which goes across the batch. This makes it batch-size independent, essential for Transformers.

Question 2

How do you solve Layer Normalization step by step?

Accepted Answer

Compute mean μ and variance σ² across all features of one vector. Normalize: x̂ = (x − μ) / √(σ² + ε). Scale and shift: y = γ⊙x̂ + β  (learned per feature). Result has zero mean and unit variance across features. Used after attention and FFN layers in every Transformer block.

Question 3

What are common mistakes when solving Layer Normalization?

Accepted Answer

LayerNorm normalizes across features; BatchNorm normalizes across batch samples. LayerNorm works with batch size 1 — critical for autoregressive generation. RMSNorm (used in LLaMA) removes the mean subtraction step for efficiency.

Layer Normalization — Step-by-Step Visualization

Algorithm Pattern

Key Idea

Step-by-Step Approach

Common Gotchas

Related Problems