The tool use leaderboard · for developers

Best for tool use

5 editor picks · 9 eligible models · Reliable function calling and structured outputs.

Editorial pick plus benchmark and API pricing context.

See raw /best

EDITOR'S CHOICEResearched 186d ago

Gemini 3 Pro

Google DeepMind · 1m context

Excellent

Structured output is table stakes — pick on schema reliability and price.

Best current BFCL (72.5) with rock-solid JSON-schema adherence and a 1M window at $5 out.

Open model

The numbers

$/1M out

$5.00

$1.25 input

Context

max window

Pros

+Top current BFCL
+Native structured outputs
+Cheap for the tier

Cons

−Trails Claude on tricky tool selection

Also worth picking

The runners-up

ranked by editorial pick orderEditorial tiersExcellentStrongSolid

#ModelTier$/1M outEditor's note

Claude Sonnet 5

Anthropic · 1m

$10.00 / 1M out

Watchlist: strong function calling, structured outputs, 86.6% BrowseComp, 81.2% OSWorld, and 57.4% HLE-with-tools — do not claim BFCL leadership without a BFCL row.

Claude Sonnet 5

Anthropic · 1m

$10.00

Watchlist: strong function calling, structured outputs, 86.6% BrowseComp, 81.2% OSWorld, and 57.4% HLE-with-tools — do not claim BFCL leadership without a BFCL row.

Claude Sonnet 4.6

Anthropic · 1m

$15.00 / 1M out

Best at picking the right tool when ten look plausible; pairs schema discipline with τ-bench leadership.

Claude Sonnet 4.6

Anthropic · 1m

$15.00

Best at picking the right tool when ten look plausible; pairs schema discipline with τ-bench leadership.

GPT-5.4

OpenAI · 1.05m

$15.00 / 1M out

Matches GPT-5.5's structured-output reliability at half the output price ($15 vs $30) — the GPT to run at high call volume.

GPT-5.4

OpenAI · 1.05m

$15.00

Matches GPT-5.5's structured-output reliability at half the output price ($15 vs $30) — the GPT to run at high call volume.

Qwen3.5-397B-A17B

Alibaba · 262k

$2.34 / 1M out

BFCL 72.9 with open weights — strong function calling you can self-host.

Qwen3.5-397B-A17B

Alibaba · 262k

$2.34

BFCL 72.9 with open weights — strong function calling you can self-host.

Eligibility

9 models are eligible for this board

Eligibility means tagged with useCases: [tool-use]. Pins must come from this pool.

All picks