LLM Reference

The tool use leaderboard · for developers

Best for tool use

4 editor picks · 6 eligible models · Reliable function calling and structured outputs.

See raw /best
EDITOR'S CHOICEResearched 141d ago

Gemini 3 Pro

Google DeepMind · 1M context
Excellent

Structured output is table stakes — pick on schema reliability and price.

Best current BFCL (72.5) with rock-solid JSON-schema adherence and a 1M window at $5 out.

The numbers
$/1M out
$5.00
$1.25 input
Context
1M
max window
Pros
  • +Top current BFCL
  • +Native structured outputs
  • +Cheap for the tier
Cons
  • Trails Claude on tricky tool selection

Also worth picking

The runners-up

ranked by editorial pick order
Editorial tiersExcellentStrongSolid
#ModelTier$/1M outEditor's note
#2
Anthropic · 1M
$15.00
Best at picking the right tool when ten look plausible; pairs schema discipline with τ-bench leadership.
#3
OpenAI · 1M
$15.00
Matches GPT-5.5's structured-output reliability at half the output price ($15 vs $30) — the GPT to run at high call volume.
#4
Alibaba · 256K
$2.34
BFCL 72.9 with open weights — strong function calling you can self-host.

Eligibility

6 models are eligible for this board

Eligibility means tagged with useCases: [tool-use]. Pins must come from this pool.

All picks