Lesson 8 · 11 min
Cost-aware multimodal routing
Vision tokens are 2-5× the cost of text tokens. The routing patterns that keep multimodal features viable at production volume.
The economics
A single high-resolution image at typical 2026 prices costs the same as a 3,000-token text prompt. Multiply by call volume and the bill stops being a rounding error.
The four levers that keep multimodal economics defensible:
- Resize before sending. Most images can be downsized to 768px long-edge with no quality loss for most tasks.
- Cache stable visual context. If your prompt includes a logo, a layout reference, or a stable diagram, prefix-cache it.
- Tier your models. Cheap vision-capable model for triage, frontier for the hard 20%.
- OCR locally when text is the goal. If you only need the text from an image, run Tesseract or a small OCR model and send the text to a text-only LLM.