Lesson 4 · 9 min
Output filtering and harm classification
Don't just trust the model. The provider's harm filter, your domain-specific filter, and a final post-processing pass are the three layers between the model and the user.
What providers give you
Anthropic, OpenAI, Google, and Bedrock all ship harm-classification systems either built into their models (refusal training) or as separate moderation endpoints. Coverage typically includes:
- Violence and weapons
- Sexual content involving minors (universally blocked)
- Self-harm content
- Hate speech and harassment
- Illegal acts
- Privacy violations (PII, doxxing)
What they don't filter:
- Domain-specific harms (giving medical advice when you sell a fitness app, financial advice when you sell tax software)
- Brand-damaging output (your model recommending a competitor)
- Output that's technically legal but violates your policy
- Subtle bias in tone or framing