Step through KV caching in autoregressive generation — watch how storing past key-value pairs avoids redundant computation as the sequence grows.
Memoize Past Key-Value Pairs
During autoregressive generation, each new token must attend to ALL previous tokens. Without a cache, K and V for every past token are recomputed each step. The KV cache stores them once, reducing complexity from O(n²) to O(n) per token.