Lesson 8 · 11 min
Context windows, KV cache & long context
Why a 1M-token context is impressive — and expensive — and slower than you think.
What's actually getting cached
During inference, for every previous token, the model has computed its K and V vectors (one per layer per head). Those are stored in the KV cache so future tokens don't recompute them.
The cache grows linearly with context length. So does the per-token attention compute. Doubling context roughly doubles memory and compute at inference.