Lesson 5 · 10 min
Semantic caching — eliminate redundant LLM calls
Semantic caching returns cached responses for semantically equivalent questions, even when the phrasing differs. At scale it eliminates 30–60% of LLM calls — with no degradation in answer quality for stable knowledge.
The problem: identical intent, different text
A customer support chatbot receives these messages on the same day:
- "How do I cancel my subscription?"
- "Cancel subscription, how?"
- "What's the process to cancel my account?"
- "I want to cancel, what do I do?"
Exact-match caching (standard HTTP caching) misses all of these. Each triggers a full LLM call. Semantic caching embeds the query, finds the nearest cached result above a similarity threshold, and returns it — paying for the embedding lookup (~0.0001 cents) instead of the full LLM call (~1 cent).
For frequently asked questions in a support context, semantic cache hit rates above 50% are common.