Skip to main content

Lesson 7 · 10 min

Video understanding — what works in 2026

Long-form video understanding is the hardest mainstream multimodal workload. The pragmatic patterns (frame sampling, two-stage retrieval) that turn it into shippable features.

The cost wall

A naive approach — send every frame of a 60-minute video to a vision model — burns hundreds of thousands of vision tokens per video. Cost-prohibitive for almost any product.

The pragmatic stack:

  1. Sample frames — typically 1 frame per second or 1 frame every 5-10 seconds depending on motion.
  2. Caption each sampled frame with a small vision model (or a frontier model in batch).
  3. Index the captions with timestamps in a vector store.
  4. Retrieve relevant frames for a user query, then send only those (with their captions and a small temporal window) to a frontier model for the answer.

This is RAG for video — and it works.