Lesson 6 · 10 min
Sampling — how the next token gets picked
The model outputs a distribution. Picking from it is a separate (and tunable) step.
The output isn't a token — it's a distribution
After all the attention and FFN layers, the final layer projects to vocabulary size logits. Softmax over those gives a probability distribution over every possible next token.
Which one do we actually pick? That's sampling, and it's where you have knobs.