Skip to main content
NNextGen AI Learn
All news
Paper / reportevalagentsbenchmarks

Agentic benchmarks are still mostly broken — here's what isn't

GAIA, AgentBench, SWE-bench are widely cited and widely gamed. A practical take on what to use for your own agent eval.

The state of public benchmarks

  • SWE-bench: best overall signal for code agents. Scores have climbed from 1% (2023) to 70%+ (2026). Some saturation; SWE-bench-verified is the harder subset.
  • GAIA: research/reasoning agent eval. Mostly saturated at the easy tier; tier-3 still differentiates.
  • AgentBench, ToolBench: largely gamed; weak signal vs production.
  • WebArena, OSWorld: simulated environments. Useful for browser/desktop agents; brittle when real websites change.

What works in practice

Private evals are now the consensus. Pattern:

  1. 20–50 real tasks from your domain with known correct outcomes.
  2. 3+ runs per task (agents are stochastic).
  3. Task-level pass rate as the primary metric. Aggregate hides the long tail.
  4. Trace review on a sample — LLM-as-judge on the reasoning trace plus a weekly human review.
  5. Cost per success as a co-equal metric.

If you're shipping an agent and not running this kind of eval weekly, the agent will silently degrade.

Want the deep dive?

The lessons that ground this news in mechanics — not opinion.

Browse courses