Lesson 7 · 9 min

Loss functions — picking the right one

Different problems need different losses. Three you'll meet 90% of the time.

The big three

1. Cross-entropy — for classification (and language modeling)

L = -Σ y_true · log(y_pred)

Used when the model outputs a probability distribution over discrete classes (e.g. "is this token A, B, or C"). All language modeling uses cross-entropy.

2. Mean Squared Error (MSE) — for regression

L = mean((y_true - y_pred)²)

Used when predicting a continuous number (price, temperature, count). Gradient is smooth and well-behaved.

3. Binary Cross-entropy — for binary classification

L = -[y · log(p) + (1-y) · log(1-p)]

Used when the model outputs a single probability (spam or not, fraud or not).