Lesson 2 · 11 min
Vision encoders — how images become tokens
ViT, CLIP, SigLIP, and the patch-tokenization trick. Why your image gets resized to 1568×1568 before the model ever sees it.
The patch trick
You can't feed a 12-megapixel image directly to a transformer — the attention cost is quadratic in token count. The standard trick (from the original ViT paper, 2020) is patch tokenization:
- Resize the image to a fixed size — typically 224×224, 384×384, or 1568×1568 for high-resolution vision.
- Cut it into non-overlapping patches (typically 14×14 or 16×16 pixels).
- Flatten each patch and project it through a learned linear layer to produce a token-shaped embedding.
- Add positional embeddings (so the model knows where each patch sits in the grid).
A 1568×1568 image at 14×14 patches yields 12,544 visual tokens — already a meaningful slice of the LLM's context window.