Lesson 1 · 9 min

What multimodal AI actually is

A modality is just a kind of input. Modern frontier models are multimodal because they accept tokens that came from images, audio, and video — alongside text — in the same context window.

The reframe

'Multimodal' sounds exotic. It isn't. The trick is that modern frontier models accept tokens that came from a modality encoder — a vision transformer for images, a Whisper-style encoder for audio — alongside the regular text tokens.

From the LLM backbone's perspective, an image is a sequence of vector embeddings just like text. The encoder did the conversion upstream.