loading

Residual Connection — Step-by-Step Visualization

easyAIMLDeep LearningTransformer

Step through a residual (skip) connection — watch the input bypass a sublayer and add directly to its output, enabling very deep networks.

Algorithm Pattern

Skip Connection

Key Idea

A residual connection computes output = F(x) + x, letting gradients flow directly to early layers without passing through F. This solves vanishing gradients in deep networks.

Step-by-Step Approach

  1. Save the input: residual = x.
  2. Pass through sublayer F (attention or FFN): compute F(x).
  3. Add the skip: output = F(x) + residual.
  4. Apply LayerNorm to the sum.
  5. Gradient flows both through F and directly via the skip — no vanishing.

Common Gotchas

  • The skip connection requires x and F(x) to have the same shape.
  • Residual connections are the reason 1000-layer networks can be trained.
  • Pre-norm (LayerNorm before F) vs post-norm (after) — modern Transformers use pre-norm.

Related Problems