Skip to main content

Lesson 3 · 10 min

Building a red-team eval set

You can't defend against attacks you don't test. The minimum-viable red-team eval set: 30 cases mixing direct + indirect injection, jailbreaks, PII extraction, and refusal bypasses.

What goes in the set

30-50 cases is enough to start. Mix:

  • 10 direct-injection prompts drawn from public attack libraries (PromptInject, Garak, Anthropic's red-team taxonomy).
  • 10 indirect-injection cases — synthetic webpages or documents with hidden instructions, fed through your real retrieval pipeline.
  • 5 jailbreak prompts (DAN-style, role-play bypasses, hypothetical framings).
  • 5 PII extraction attempts (asking for system prompt, asking the model to repeat training data, asking it to dump tool definitions).
  • 5 refusal bypass attempts (variations the model should refuse but might comply with after creative reframing).