LLM Reference

The agents leaderboard · for developers

Best for agents

4 editor picks · 7 eligible models · Tool-use loops that don't crash on turn 12.

See raw /best
EDITOR'S CHOICEResearched 1d ago

Claude Sonnet 4.6

Anthropic · 1M context
Excellent

The most reliable tool-use loop in production — recovers from errors on its own.

Best generally-available τ-bench (87.5); stays on-task across long tool loops and self-corrects without prompting.

The numbers
$/1M out
$15.00
$3.00 input
Context
1M
max window
Pros
  • +Top GA τ-bench score
  • +Reliable multi-step recovery
  • +1M context
Cons
  • $15 / 1M out
  • Not the cheapest per step

Also worth picking

The runners-up

ranked by editorial pick order
Editorial tiersExcellentStrongSolid
#ModelTier$/1M outEditor's note
#2
Zhipu AI · 200K
$2.08
τ-bench 82.1, open weights at $2.08 out — the best price-per-step agent we'd run always-on.
#3
OpenAI · 1M
$15.00
Picked over GPT-5.5 here: it carries the measured τ-bench score (78.3) and runs at half the output price ($15 vs $30) — decisive for always-on agents.
#4
Google DeepMind · 1M
$5.00
Best for chatty agents with many read-only tools and large contexts.

Eligibility

7 models are eligible for this board

Eligibility means tagged with useCases: [agents]. Pins must come from this pool.

All picks