Lesson 4 · 10 min
Augmenting rare classes and edge cases
Real data is skewed — common cases dominate and rare failures are underrepresented. Targeted synthetic augmentation fills the gaps that matter most for production reliability.
The long tail problem
In any real-world classification or extraction task, the data distribution follows a power law:
- 80% of examples cover 20% of categories
- The rare cases (the remaining 20% of examples spread across 80% of categories) are exactly the ones that fail in production
A model trained on this distribution learns the common cases well and fails on rare but important ones. Synthetic augmentation lets you deliberately balance the distribution.
Common augmentation targets:
- Rare categories — intent classes with < 50 real examples
- Adversarial inputs — inputs designed to trigger failure modes
- Edge cases — boundary conditions, ambiguous cases, multi-label examples
- Domain shifts — slightly different register, terminology, or format than the training distribution