loading

KV Cache — Step-by-Step Visualization

mediumAIMLGenerative AILLMInference

Step through KV caching in autoregressive generation — watch how storing past key-value pairs avoids redundant computation as the sequence grows.

Algorithm Pattern

Memoize Past Key-Value Pairs

Key Idea

During autoregressive generation, each new token must attend to ALL previous tokens. Without a cache, K and V for every past token are recomputed each step. The KV cache stores them once, reducing complexity from O(n²) to O(n) per token.

Step-by-Step Approach

  1. For each new token, compute Q, K, V from the token embedding.
  2. Without cache: recompute K_i, V_i for all i < t at every step.
  3. With cache: store K_t and V_t immediately after computing them.
  4. On the next step: retrieve past K, V from cache; only compute new K, V.
  5. Memory trade-off: cache grows linearly with sequence length.

Common Gotchas

  • KV cache memory = num_layers × num_heads × seq_len × head_dim × 2 × bytes.
  • Paged attention (vLLM) manages KV cache memory like virtual memory pages.
  • KV cache only works for DECODER attention — encoder KV is reused across all decode steps.

Related Problems