loading

Layer Normalization — Step-by-Step Visualization

easyAIMLNormalizationTransformer

Step through layer normalization — watch a single token's feature vector normalized across its dimensions, as used in every Transformer block.

Algorithm Pattern

Per-Sample Feature Normalization

Key Idea

LayerNorm normalizes across features for each sample independently — unlike BatchNorm which goes across the batch. This makes it batch-size independent, essential for Transformers.

Step-by-Step Approach

  1. Compute mean μ and variance σ² across all features of one vector.
  2. Normalize: x̂ = (x − μ) / √(σ² + ε).
  3. Scale and shift: y = γ⊙x̂ + β (learned per feature).
  4. Result has zero mean and unit variance across features.
  5. Used after attention and FFN layers in every Transformer block.

Common Gotchas

  • LayerNorm normalizes across features; BatchNorm normalizes across batch samples.
  • LayerNorm works with batch size 1 — critical for autoregressive generation.
  • RMSNorm (used in LLaMA) removes the mean subtraction step for efficiency.

Related Problems