Lesson 7 · 9 min
Loss functions — picking the right one
Different problems need different losses. Three you'll meet 90% of the time.
The big three
1. Cross-entropy — for classification (and language modeling)
L = -Σ y_true · log(y_pred)Used when the model outputs a probability distribution over discrete classes (e.g. "is this token A, B, or C"). All language modeling uses cross-entropy.
2. Mean Squared Error (MSE) — for regression
L = mean((y_true - y_pred)²)Used when predicting a continuous number (price, temperature, count). Gradient is smooth and well-behaved.
3. Binary Cross-entropy — for binary classification
L = -[y · log(p) + (1-y) · log(1-p)]Used when the model outputs a single probability (spam or not, fraud or not).