Lesson 4 · 10 min
Prompt caching at every stable prefix
The single largest cost lever modern providers offer: 80-90% discounts on repeated context prefixes. The catch: prefixes have to be byte-identical.
The mechanic
When you send the same prefix tokens across requests, the provider can skip recomputing the K/V cache for those tokens — which is the expensive part of inference. They pass that saving back to you as a discount.
- Anthropic: cache_control: ephemeral, 90% discount on cached input tokens, 5-minute TTL.
- OpenAI: automatic for prefixes ≥1024 tokens.
- Google Gemini: explicit
cachedContentresource you create and reference. - Bedrock: per-model implementation.
The rules across all of them: prefix has to be byte-identical, and the cache has a TTL (5min default; long-lived caches are paid extra).