Lesson 5 · 9 min
Hyperparameters that actually matter
Most hyperparameters don't matter much. A few do — a lot.
The short list
For LoRA SFT, the hyperparameters that move outcomes most:
- Learning rate — biggest single lever. LoRA-SFT typically uses 1e-4 to 5e-4 (10-50× higher than full fine-tuning). Default: 2e-4.
- Number of epochs — 1-3 is standard. More than 3 on a small dataset overfits hard.
- Effective batch size (per_device × grad_accum) — 16-64 is typical. Bigger = stabler gradients but slower convergence.
- LoRA rank `r` — 8-16 covers most cases. Higher only helps for very large datasets.
- Warmup ratio — 0.03-0.1. Skip and the loss can spike at start.