Lesson 4 · 9 min

Output filtering and harm classification

Don't just trust the model. The provider's harm filter, your domain-specific filter, and a final post-processing pass are the three layers between the model and the user.

What providers give you

Anthropic, OpenAI, Google, and Bedrock all ship harm-classification systems either built into their models (refusal training) or as separate moderation endpoints. Coverage typically includes:

Violence and weapons
Sexual content involving minors (universally blocked)
Self-harm content
Hate speech and harassment
Illegal acts
Privacy violations (PII, doxxing)

What they don't filter:

Domain-specific harms (giving medical advice when you sell a fitness app, financial advice when you sell tax software)
Brand-damaging output (your model recommending a competitor)
Output that's technically legal but violates your policy
Subtle bias in tone or framing