Lesson 4 · 10 min

Inside a transformer block

Attention is one piece. The transformer block stacks it with norms, residuals, and a feed-forward layer.

The four ingredients

A transformer is N identical blocks stacked on top of each other. Each block has four moving parts:

Layer normalization — keeps activations from drifting in scale.
Multi-head attention — what we just covered.
A residual connection — the input is added back after attention. This is what lets very deep networks train.
Feed-forward (MLP) — a small two-layer neural network applied independently per token. This is where most of the model's parameters actually live.