Lesson 9 · 10 min
RAG in production: cost, latency, freshness
A working RAG demo is 10% of the work. The rest is keeping it healthy.
The four production levers
1. Latency
Total latency = embed query + ANN search + (rerank?) + LLM generation.
- Cache query embeddings (LRU on the literal query string).
- Use a small fast embedding model for queries (BGE-small, voyage-light).
- Pre-compute and cache the system prompt + few-shot prefix → prompt caching for 80% cost cut.
2. Cost
- Embedding cost is one-time per chunk + recurring per query.
- LLM cost dominates at scale. Reduce by: shorter chunks, fewer chunks, smaller LLM for easy queries, cache.
3. Freshness
When the corpus changes:
- Append-only: cheap, but old vectors stay forever — periodically re-index.
- Re-embed all: expensive but clean. Schedule weekly/monthly.
- Detect changes via hash, re-embed only changed chunks.
4. Observability
Log every query: top-k chunk IDs, scores, final answer, user feedback. Without this, debugging is impossible.