The default that isn't
Somewhere around 2024, "AI feature" became synonymous with "RAG pipeline". Every GenAI demo opens with the same 5 boxes: documents → chunks → embeddings → vector DB → LLM.
It's a great pattern. It's also the wrong default in at least four common cases. We see teams burning 4–6 weeks on infrastructure that a 200-word prompt would have solved.
Case 1: Your corpus fits in the context
If you have 30k tokens of stable docs and a model with a 128k window, paste it. Use prompt caching to make the prefix free.
Cost analysis at 1000 calls/day:
- RAG pipeline: embedding API + vector DB + retrieval latency + maintenance + 30-50% accuracy floor from chunk-boundary issues. Engineering time: ~4 weeks. Ops cost: $50–200/month.
- Stuffed context with caching: 1 token-cache write at deploy, 90% cheaper subsequent calls, no retrieval failures. Engineering time: ~2 hours. Ops cost: $5–30/month.
The crossover happens around 50k–100k tokens. Below that, RAG is engineering theater.
Case 2: The answer is in the model's training data
Asking your RAG bot "what is a transformer?" When the LLM already knows. The retrieval adds nothing. Worse: the retrieved chunks pollute the answer with formatting from your corpus, and you've added 5 boxes to the architecture for zero gain.
If a baseline call to the model with no retrieval gets it right 90% of the time, RAG is the wrong move.
Case 3: You need style or format, not knowledge
You want the model to always output JSON, always in your brand voice, always in a specific structure. RAG doesn't help — your corpus doesn't change the model's style. What helps:
- Native structured-output mode (OpenAI, Anthropic, Google all support it). Eliminates 80% of JSON parsing errors.
- Strong few-shot examples in the prompt.
- Fine-tuning for stubborn style/format gaps. Yes, fine-tuning is harder. It's also the right hard, vs. RAG's wrong hard for this case.
Case 4: The corpus changes faster than you can re-index
A daily-changing news feed, a constantly-edited internal doc set, a chat history. Re-embedding every chunk on every change is expensive and runs into rate limits.
Better patterns:
- Inline retrieval at query time — search → fetch raw → stuff into prompt. No persistent vector index. Slower per query, but no staleness.
- Hybrid keyword search with the LLM doing the synthesis. Postgres full-text + LLM beats RAG for highly volatile data.
- Tool use — let the model call
search_docs(query)directly via function calling.
The decision tree
We turned this into a real decision tree in [Fine-tuning lesson 1](https://nextgenailearn.com/app/lesson/ft-01):
- Better prompt with examples and constraints — try first.
- Native structured output if you need format guarantees.
- Stuff the corpus if it fits and is stable.
- Tool use / inline search for volatile data.
- RAG for stable, large, knowledge-gap cases (genuinely the right tool a third of the time).
- Fine-tuning for style/format/capability gaps after RAG.
Most engineers default to step 5. The seniors I know default to step 1, escalate one rung at a time, and ship 6 weeks faster than the team that started with a vector DB.
When RAG is genuinely right
To be fair: RAG is the right tool when:
- Your corpus is large (>100k tokens) and stable (changes weekly, not hourly).
- The answer is specific to your data, not in the model's training cutoff.
- You need citations back to source documents.
- You'll run >10k queries/day against the same corpus (amortizes infra cost).
When all four are true: build RAG, build it well, follow [our path](https://nextgenailearn.com/paths/rag-vector-dbs). When any are false: try simpler first.
The skill being a senior AI engineer is choosing the least AI-shaped solution that solves the problem. RAG is often more AI-shaped than the problem demands.