Lesson 8 · 11 min

Context windows, KV cache & long context

Why a 1M-token context is impressive — and expensive — and slower than you think.

What's actually getting cached

During inference, for every previous token, the model has computed its K and V vectors (one per layer per head). Those are stored in the KV cache so future tokens don't recompute them.

The cache grows linearly with context length. So does the per-token attention compute. Doubling context roughly doubles memory and compute at inference.