Lesson 8 · 14 min

Capstone — production eval system end-to-end

Build a complete eval pipeline for a customer support classifier: golden dataset, deterministic + LLM-as-judge evals, CI integration, and production monitoring. Walk through the decisions that make it robust.

The scenario

You're responsible for a customer support ticket classifier that routes tickets to billing, auth, product, or other. The model behind it changed last week, accuracy complaints came in, and you have no evals. You have 72 hours to build a production-grade eval system.

This capstone walks the full build: dataset → evals → CI → monitoring.