Lesson 1 · 9 min
What multimodal AI actually is
A modality is just a kind of input. Modern frontier models are multimodal because they accept tokens that came from images, audio, and video — alongside text — in the same context window.
The reframe
'Multimodal' sounds exotic. It isn't. The trick is that modern frontier models accept tokens that came from a modality encoder — a vision transformer for images, a Whisper-style encoder for audio — alongside the regular text tokens.
From the LLM backbone's perspective, an image is a sequence of vector embeddings just like text. The encoder did the conversion upstream.