Skip to main content

Lesson 1 · 9 min

Tokens — what models actually see

Models do not read characters or words. They read tokens. This one reframe explains a lot of weird behavior.

Not characters. Not words. Tokens.

When you send "unbelievable performance" to an LLM, it doesn't see those 23 characters. It sees a sequence of integer token IDs, each looking up a row in the model's embedding table.

Tokens are usually:

  • Whole common words like performance (note the leading space — that's part of the token).
  • Subwords for rarer or compound words: un + believ + able.
  • Single bytes for emoji, rare characters, or anything outside the vocabulary.