Skip to main content

Lesson 5 · 9 min

Hyperparameters that actually matter

Most hyperparameters don't matter much. A few do — a lot.

The short list

For LoRA SFT, the hyperparameters that move outcomes most:

  1. Learning rate — biggest single lever. LoRA-SFT typically uses 1e-4 to 5e-4 (10-50× higher than full fine-tuning). Default: 2e-4.
  2. Number of epochs — 1-3 is standard. More than 3 on a small dataset overfits hard.
  3. Effective batch size (per_device × grad_accum) — 16-64 is typical. Bigger = stabler gradients but slower convergence.
  4. LoRA rank `r` — 8-16 covers most cases. Higher only helps for very large datasets.
  5. Warmup ratio — 0.03-0.1. Skip and the loss can spike at start.