The 18 LLM leaderboards we'd actually use.
Editor-curated picks for every job — coding, writing, image, voice. One Editor's Choice per board, every pick tagged with the use case it qualified for. Refreshed weekly.
Developers
Coding
Leads both SWE-bench Verified (87.6) and SWE-bench Pro (64.3) and tops Chatbot Arena; the surest hand on a real PR.
Agents
Best generally-available τ-bench (87.5); stays on-task across long tool loops and self-corrects without prompting.
Tool use
Best current BFCL (72.5) with rock-solid JSON-schema adherence and a 1M window at $5 out.
Open weights
Best open-weights model we've tested: #1 LiveCodeBench (93.5), 80.6 SWE-bench, 1M context, $0.87 out.
Long context
The long-context standard — 1M tokens with cheap input that keeps the bill survivable.
Cheap
$0.22 / 1M out with LiveCodeBench 91.6 and a 1M window — nothing else is this capable this cheap.
Knowledge workers
Writing
Tops Chatbot Arena (1503) and writes paragraphs you'd ship; understands tone notes and edits like a copy chief.
Research
GPQA Diamond 94.2 (top GA) with the cleanest footnoted synthesis across many sources.
Summarization
1M context plus MMLU-Pro 88.6 at $3 out — handles a 500-page transcript faithfully without breaking the budget.
Docs Q&A
Highest groundedness and graceful refusal — says "not in the docs" instead of guessing.
Translation
Best coverage of lower-resource languages with strong idiom handling and a huge context for document-level consistency.
Data & SQL
Best text-to-SQL accuracy in production — HumanEval 94.2 and top-tier reasoning pick the right join and respect dialect quirks.
Creatives
Image
The current photoreal leader — brand-consistent, with the best text rendering and hands in the open ecosystem.
Video
Best shot-to-shot continuity with native synchronized audio.
Voice (TTS)
Most expressive voices with the lowest artifact rate — the studio standard for production VO.
Transcription
Lowest WER on noisy real-world audio with the broadest language coverage; cheap to self-host.
Music
Most production-ready output; stems are usable and the vocals are believable.
Image editing
Best inpaint preservation — brand colors and untouched regions stay put across edits.
How picks work
A model is eligible for a board only if it's tagged with that use case. Editors pin a handful per board, with exactly one designated Editor's Choice.
Editorial tiers
Each pick is bucketed into one of three qualitative tiers — Excellent · Strong · Solid. No decimals, no composite score, just editorial judgment.
vs. /best composites
Picks are opinionated. /best → is the objective benchmark composite for the same capability.