Skip to main content

Lesson 7 · 10 min

Production monitoring — catching drift before users do

Your eval suite passes in CI. Then the model provider updates their weights. Or real-world input distribution shifts. Production monitoring runs your evals continuously against live traffic so you catch silent degradation.

The gap between CI and production

CI evals run on a static, curated dataset. Production inputs are messy, evolving, and never quite what you curated. The two failure modes CI misses:

  1. Input distribution shift — users start asking different kinds of questions. Your eval dataset doesn't cover the new pattern.
  2. Model drift — provider updates model weights without a version bump. Behavior changes silently.

Production monitoring catches both by running eval logic against a sample of live traffic.