Lesson 3 · 12 min

Attention — the trick that made LLMs work

For every token at every layer, the model looks back at every other token and decides what to focus on. That's attention.

The intuition

Earlier sequence models (RNNs, LSTMs) processed text one token at a time, smearing the past into a single hidden state. They were bad at long-range dependencies — by the time the model got to the end of a long sentence, it had forgotten the start.

Attention is the trick that fixed this: at every step, the model can look at every previous token and decide how much each one matters.