Skip to main content

Lesson 4 · 9 min

Output filtering and harm classification

Don't just trust the model. The provider's harm filter, your domain-specific filter, and a final post-processing pass are the three layers between the model and the user.

What providers give you

Anthropic, OpenAI, Google, and Bedrock all ship harm-classification systems either built into their models (refusal training) or as separate moderation endpoints. Coverage typically includes:

  • Violence and weapons
  • Sexual content involving minors (universally blocked)
  • Self-harm content
  • Hate speech and harassment
  • Illegal acts
  • Privacy violations (PII, doxxing)

What they don't filter:

  • Domain-specific harms (giving medical advice when you sell a fitness app, financial advice when you sell tax software)
  • Brand-damaging output (your model recommending a competitor)
  • Output that's technically legal but violates your policy
  • Subtle bias in tone or framing