The state of public benchmarks
- SWE-bench: best overall signal for code agents. Scores have climbed from 1% (2023) to 70%+ (2026). Some saturation; SWE-bench-verified is the harder subset.
- GAIA: research/reasoning agent eval. Mostly saturated at the easy tier; tier-3 still differentiates.
- AgentBench, ToolBench: largely gamed; weak signal vs production.
- WebArena, OSWorld: simulated environments. Useful for browser/desktop agents; brittle when real websites change.
What works in practice
Private evals are now the consensus. Pattern:
- 20–50 real tasks from your domain with known correct outcomes.
- 3+ runs per task (agents are stochastic).
- Task-level pass rate as the primary metric. Aggregate hides the long tail.
- Trace review on a sample — LLM-as-judge on the reasoning trace plus a weekly human review.
- Cost per success as a co-equal metric.
If you're shipping an agent and not running this kind of eval weekly, the agent will silently degrade.