Skip to main content

Lesson 4 · 10 min

Inside a transformer block

Attention is one piece. The transformer block stacks it with norms, residuals, and a feed-forward layer.

The four ingredients

A transformer is N identical blocks stacked on top of each other. Each block has four moving parts:

  1. Layer normalization — keeps activations from drifting in scale.
  2. Multi-head attention — what we just covered.
  3. A residual connection — the input is added back after attention. This is what lets very deep networks train.
  4. Feed-forward (MLP) — a small two-layer neural network applied independently per token. This is where most of the model's parameters actually live.