The agents leaderboard · for developers

Best for agents

6 editor picks · 19 eligible models · Tool-use loops that don't crash on turn 12.

Editorial pick plus benchmark and API pricing context.

See raw /best

EDITOR'S CHOICEResearched 6d ago

Claude Sonnet 5

Anthropic · 1m context

Excellent

The most agentic Sonnet yet — strong computer use and multi-step recovery at Sonnet pricing.

Anthropic's new default agent: 85.2% SWE-bench Verified, 81.2% OSWorld-Verified, 86.6% BrowseComp multi-agent, and 1M context at the same durable $3/$15 as Sonnet 4.6 ($2/$10 intro through 2026-08-31).

Open model

The numbers

$/1M out

$10.00

$2.00 input

Context

max window

Pros

+81.2% OSWorld-Verified
+86.6% BrowseComp multi-agent
+1M context at Sonnet list price

Cons

−No τ-bench or MultiChallenge row yet
−$15 / 1M out at list

Also worth picking

The runners-up

ranked by editorial pick orderEditorial tiersExcellentStrongSolid

#ModelTier$/1M outEditor's note

Claude Sonnet 4.6

Anthropic · 1m

$15.00 / 1M out

Best generally-available τ-bench (87.5) among predecessors — solid fallback until Sonnet 5 publishes comparable agent-bench rows.

Claude Sonnet 4.6

Anthropic · 1m

$15.00

Best generally-available τ-bench (87.5) among predecessors — solid fallback until Sonnet 5 publishes comparable agent-bench rows.

Claude Fable 5

Anthropic · 1m

$50.00 / 1M out

OSWorld-Verified 85.0% and 80.3% SWE-bench Pro make it a top-tier computer-use and coding-agent candidate, but no Fable-specific τ-bench score is published yet.

Claude Fable 5

Anthropic · 1m

$50.00

OSWorld-Verified 85.0% and 80.3% SWE-bench Pro make it a top-tier computer-use and coding-agent candidate, but no Fable-specific τ-bench score is published yet.

GLM-5

Zhipu AI · 200k

$2.08 / 1M out

τ-bench 82.1, open weights at $2.08 out — the best price-per-step agent we'd run always-on.

GLM-5

Zhipu AI · 200k

$2.08

τ-bench 82.1, open weights at $2.08 out — the best price-per-step agent we'd run always-on.

GPT-5.4

OpenAI · 1.05m

$15.00 / 1M out

Picked over GPT-5.5 here: it carries the measured τ-bench score (78.3) and runs at half the output price ($15 vs $30) — decisive for always-on agents.

GPT-5.4

OpenAI · 1.05m

$15.00

Picked over GPT-5.5 here: it carries the measured τ-bench score (78.3) and runs at half the output price ($15 vs $30) — decisive for always-on agents.

Gemini 3 Pro

Google DeepMind · 1m

$5.00 / 1M out

Best for chatty agents with many read-only tools and large contexts.

Gemini 3 Pro

Google DeepMind · 1m

$5.00

Best for chatty agents with many read-only tools and large contexts.

Eligibility

19 models are eligible for this board

Eligibility means tagged with useCases: [agents]. Pins must come from this pool.

All picks