Lesson 3 · 12 min

Building a training dataset

Bad data destroys good models. Good data is half the work — and most of where you should spend time.

The 90/10 rule

90% of fine-tune outcomes come from the dataset, not the hyperparameters. Beginners spend their time tuning learning rate; experts spend their time curating data.

A fine-tune dataset for supervised fine-tuning (SFT) is a list of (input, expected_output) pairs. Common formats:

{"messages": [
  {"role": "system", "content": "You are a tax assistant."},
  {"role": "user", "content": "Can I deduct my home office?"},
  {"role": "assistant", "content": "Yes, if it is exclusively used for business..."}
]}