Skip to main content

Lesson 3 · 10 min

Token budgets per request

Every customer-facing LLM call gets a written budget — input tokens, output tokens, max latency. Reviewed in PR like any other resource concern.

A real budget for a summarization call

A typical customer-facing summarization call:

  • System prompt: 800 tokens (cached, $0.0008/1k input)
  • Retrieved context: max 4,000 tokens
  • User input: max 500 tokens
  • Output: max 600 tokens
  • Total budget: ~5,900 tokens, ~$0.012 per call at frontier-tier prices
  • p99 latency budget: 4 seconds

When the budget is violated in production (long retrieved context, abusive user input), the request either truncates with a logged warning or routes to a different model.

Without budgets you don't have a feature; you have an unbounded cost surface.