Lesson 7 · 10 min
Video understanding — what works in 2026
Long-form video understanding is the hardest mainstream multimodal workload. The pragmatic patterns (frame sampling, two-stage retrieval) that turn it into shippable features.
The cost wall
A naive approach — send every frame of a 60-minute video to a vision model — burns hundreds of thousands of vision tokens per video. Cost-prohibitive for almost any product.
The pragmatic stack:
- Sample frames — typically 1 frame per second or 1 frame every 5-10 seconds depending on motion.
- Caption each sampled frame with a small vision model (or a frontier model in batch).
- Index the captions with timestamps in a vector store.
- Retrieve relevant frames for a user query, then send only those (with their captions and a small temporal window) to a frontier model for the answer.
This is RAG for video — and it works.