LLM Reference

The coding leaderboard · for developers

Best for coding

4 editor picks · 5 eligible models · The model we'd hand a stranger reviewing a PR.

See raw /best
EDITOR'S CHOICEResearched 8d ago

Claude Opus 4.7

Anthropic · 1M context
Excellent

Top SWE-bench in production — patches like a senior engineer.

Leads both SWE-bench Verified (87.6) and SWE-bench Pro (64.3) and tops Chatbot Arena; the surest hand on a real PR.

The numbers
$/1M out
$25.00
$5.00 input
Context
1M
max window
Pros
  • +#1 SWE-bench Verified & Pro (GA)
  • +1M context for whole-repo edits
  • +Lowest re-prompt rate we see
Cons
  • $25 / 1M out — premium tier
  • Overkill for boilerplate

Also worth picking

The runners-up

ranked by editorial pick order
Editorial tiersExcellentStrongSolid
#ModelTier$/1M outEditor's note
#2
OpenAI · 1M
$30.00
OpenAI's current flagship: SWE-bench Pro 58.6 and HumanEval 94.2 — the best non-Claude reviewer we run on a real PR.
#3
DeepSeek · 1M
$0.87
Tops LiveCodeBench (93.5) and 80.6 SWE-bench at $0.87 out, open weights — near-frontier coding for a fraction of the price.
#4
Anthropic · 1M
$15.00
The everyday coding workhorse: most of Opus 4.7's instincts at $15 out.

Eligibility

5 models are eligible for this board

Eligibility means tagged with useCases: [coding]. Pins must come from this pool.

All picks