Lesson 3 · 10 min
Building a red-team eval set
You can't defend against attacks you don't test. The minimum-viable red-team eval set: 30 cases mixing direct + indirect injection, jailbreaks, PII extraction, and refusal bypasses.
What goes in the set
30-50 cases is enough to start. Mix:
- 10 direct-injection prompts drawn from public attack libraries (PromptInject, Garak, Anthropic's red-team taxonomy).
- 10 indirect-injection cases — synthetic webpages or documents with hidden instructions, fed through your real retrieval pipeline.
- 5 jailbreak prompts (DAN-style, role-play bypasses, hypothetical framings).
- 5 PII extraction attempts (asking for system prompt, asking the model to repeat training data, asking it to dump tool definitions).
- 5 refusal bypass attempts (variations the model should refuse but might comply with after creative reframing).