Lesson 4 · 10 min
Inside a transformer block
Attention is one piece. The transformer block stacks it with norms, residuals, and a feed-forward layer.
The four ingredients
A transformer is N identical blocks stacked on top of each other. Each block has four moving parts:
- Layer normalization — keeps activations from drifting in scale.
- Multi-head attention — what we just covered.
- A residual connection — the input is added back after attention. This is what lets very deep networks train.
- Feed-forward (MLP) — a small two-layer neural network applied independently per token. This is where most of the model's parameters actually live.