Skip to main content

Lesson 5 · 10 min

Semantic caching — eliminate redundant LLM calls

Semantic caching returns cached responses for semantically equivalent questions, even when the phrasing differs. At scale it eliminates 30–60% of LLM calls — with no degradation in answer quality for stable knowledge.

The problem: identical intent, different text

A customer support chatbot receives these messages on the same day:

  • "How do I cancel my subscription?"
  • "Cancel subscription, how?"
  • "What's the process to cancel my account?"
  • "I want to cancel, what do I do?"

Exact-match caching (standard HTTP caching) misses all of these. Each triggers a full LLM call. Semantic caching embeds the query, finds the nearest cached result above a similarity threshold, and returns it — paying for the embedding lookup (~0.0001 cents) instead of the full LLM call (~1 cent).

For frequently asked questions in a support context, semantic cache hit rates above 50% are common.