Lesson 5 · 9 min

Hyperparameters that actually matter

Most hyperparameters don't matter much. A few do — a lot.

The short list

For LoRA SFT, the hyperparameters that move outcomes most:

Learning rate — biggest single lever. LoRA-SFT typically uses 1e-4 to 5e-4 (10-50× higher than full fine-tuning). Default: 2e-4.
Number of epochs — 1-3 is standard. More than 3 on a small dataset overfits hard.
Effective batch size (per_device × grad_accum) — 16-64 is typical. Bigger = stabler gradients but slower convergence.
LoRA rank `r` — 8-16 covers most cases. Higher only helps for very large datasets.
Warmup ratio — 0.03-0.1. Skip and the loss can spike at start.