Lesson 2 · 11 min

Vision encoders — how images become tokens

ViT, CLIP, SigLIP, and the patch-tokenization trick. Why your image gets resized to 1568×1568 before the model ever sees it.

The patch trick

You can't feed a 12-megapixel image directly to a transformer — the attention cost is quadratic in token count. The standard trick (from the original ViT paper, 2020) is patch tokenization:

Resize the image to a fixed size — typically 224×224, 384×384, or 1568×1568 for high-resolution vision.
Cut it into non-overlapping patches (typically 14×14 or 16×16 pixels).
Flatten each patch and project it through a learned linear layer to produce a token-shaped embedding.
Add positional embeddings (so the model knows where each patch sits in the grid).

A 1568×1568 image at 14×14 patches yields 12,544 visual tokens — already a meaningful slice of the LLM's context window.