The invisible cost
When an LLM call fails in production, engineers look at the output. They check the prompt. They tweak the instructions. They rarely look at what they put around those instructions — the 40k tokens of documentation included just in case, the full conversation history going back 30 turns, the retrieved chunks that weren't actually relevant.
Context is the invisible cost. It doesn't throw an error. It degrades quality silently, inflates your bill, and adds latency you attribute to the model rather than to your own engineering decisions.
The mental model that changes how you build: the context window is not infinite storage. It is a fixed-size buffer you allocate on every request, and you pay for every token.
The "lost in the middle" problem
Models reliably attend to content at the start and end of context, but miss content buried in the middle — even when it is directly relevant. If you dump a 40k-token document into the context and put the user question after it, the most important retrieved passages land in the degraded middle zone. The model answers with lower accuracy than if you had retrieved 3 shorter, targeted passages and put them at the top.
This is not a bug. It is a consequence of how transformers learn. A 200k-token context window does not give you 200k tokens of equal attention.
The four context zones
Zone 1 — System prompt (top). Always attended. Instructions, persona, output format. Keep it under 2k tokens.
Zone 2 — Retrieved evidence (after system). The most relevant documents go here, before conversation history. Most teams get this backwards — they append retrieved content at the very end, after history.
Zone 3 — Conversation history (middle). The lowest-attention zone. Recent turns matter; turns from 10 messages ago are largely ignored. Compact old turns with progressive summarization rather than including raw history indefinitely.
Zone 4 — Current user message (bottom). Always attended. The most important dynamic content belongs here.
A budget that beats "include everything"
A 22k context with highly relevant content consistently outperforms a 100k context with 80k tokens of loosely relevant padding. Budget each slot deliberately:
System prompt: 1,500 tokens (instructions, format, persona)
Cached documents: 8,000 tokens (knowledge base, prompt-cached)
Conversation history: 4,000 tokens (last 8–10 turns only)
Retrieved chunks: 6,000 tokens (3–5 relevant passages for this query)
User message: 500 tokens (this query)
Output reservation: 2,000 tokens (max_tokens)
──────────────────────────────────────
Total: 22,000 tokens (11% of a 200k window)Relevance determines quality, not volume.
Prompt caching: a 90% cost reduction in 30 minutes
If your system prompt plus static reference documents is 20,000 tokens, you pay for all 20,000 tokens on every single request. Prompt caching reduces cached token cost to roughly 10% of normal input cost.
The implementation is adding a cache_control field to your stable content block. Cache TTL is 5 minutes. On high-traffic endpoints, cache hit rates above 80% are achievable.
The only gotcha: anything before the cache breakpoint must be byte-identical across requests. A timestamp or user ID in the stable prefix causes cache misses on every request. Move all dynamic content after the breakpoint.
Conversation history: the context budget leak most teams ignore
History is the only context component that grows without a ceiling. Turn 1 is 200 tokens. Turn 100 overflows the window. Three strategies:
Sliding window (simplest): keep the last N tokens verbatim. Old context is dropped entirely. Acceptable for task-focused sessions where older context doesn't matter.
Progressive summarization (balanced): when history exceeds a threshold, summarize the oldest turns with a cheap model (Haiku costs a fraction of Sonnet), replace them with the compact summary. Preserves semantic content at lower cost.
Entity extraction (highest fidelity): extract structured facts — user preferences, decisions made, open questions — into a typed memory object. The only approach that scales to multi-day or multi-week sessions.
The 10-minute diagnostic
Add these three log lines to any production LLM call and run for one hour:
print(response.usage.input_tokens)
print(response.usage.cache_read_input_tokens)
print(response.usage.cache_creation_input_tokens)Near-zero cache reads on a busy endpoint means a caching bug. Growing input_tokens per session means unbounded history. Consistently above 50k tokens on a simple Q&A endpoint means you're including content you don't need. The numbers tell you exactly where the leverage is.
Going deeper
The [Context Window Engineering course](https://nextgenailearn.com/paths/context-window) covers every technique in this post with runnable code: priority-scored context assembly, all three history management strategies, map-reduce for large documents, prompt caching implementation, and a complete document Q&A capstone that handles 500-page PDFs.
If you're preparing for a technical interview at an AI-first company, context engineering is one of the topics most candidates gloss over. The [AI Engineering Interview Prep course](https://nextgenailearn.com/paths/interview-prep) has a full lesson on context window questions and the answers that signal seniority.