The 18 LLM leaderboards we&#x27;d actually use.

Coding

Claude Fable 5

Anthropic's new flagship: 80.3% SWE-bench Pro, 96% SWE-bench Verified on Vals.ai, and 85.0% OSWorld-Verified make it the best production coding pick for non-trivial engineering tasks.

Claude Opus 4.8

6 picks · 16 eligibleView all

Agents

Anthropic's new default agent: 85.2% SWE-bench Verified, 81.2% OSWorld-Verified, 86.6% BrowseComp multi-agent, and 1M context at the same durable $3/$15 as Sonnet 4.6 ($2/$10 intro through 2026-08-31).

6 picks · 19 eligibleView all

Claude Fable 5

Tool use

Best current BFCL (72.5) with rock-solid JSON-schema adherence and a 1M window at $5 out.

5 picks · 9 eligibleView all

Open weights

DeepSeek V4 Pro

4 picks · 15 eligibleView all

Best open-weights model we've tested: #1 LiveCodeBench (93.5), 80.6 SWE-bench, 1M context, $0.87 out.

Kimi K2.6

GLM-5.1

Long context

The long-context standard — 1M tokens with cheap input that keeps the bill survivable.

5 picks · 8 eligibleView all

Claude Opus 4.7

Cheap

DeepSeek V4 Flash

$0.22 / 1M out with LiveCodeBench 91.6 and a 1M window — nothing else is this capable this cheap.

Qwen3.5-Flash

Gemini 3 Flash

For

Knowledge workers

6 boards

Writing

Claude Opus 4.7

Tops Chatbot Arena (1503) and writes paragraphs you'd ship; understands tone notes and edits like a copy chief.

GPT-5.5

4 picks · 7 eligibleView all

Research

Claude Fable 5

5 picks · 9 eligibleView all

GDPval-AA ELO 1932 and Anthropic-reported finance, trading, and analytics wins make it the strongest general knowledge-work pick; do not use Mythos-only HLE rows as Fable evidence.

Claude Opus 4.7

GPT-5.5

Summarization

Gemini 3 Flash

1M context plus MMLU-Pro 88.6 at $3 out — handles a 500-page transcript faithfully without breaking the budget.

Claude Haiku 4.5

DeepSeek V4 Flash

Docs Q&A

Highest groundedness and graceful refusal — says "not in the docs" instead of guessing.

4 picks · 7 eligibleView all

Claude Haiku 4.5

Translation

4 picks · 5 eligibleView all

Best coverage of lower-resource languages with strong idiom handling and a huge context for document-level consistency.

GPT-5.4

Qwen3.5-397B-A17B

Data & SQL

GPT-5.5

Best text-to-SQL accuracy in production — HumanEval 94.2 and top-tier reasoning pick the right join and respect dialect quirks.

4 picks · 7 eligibleView all

DeepSeek V4 Pro

For

Creatives

6 boards

Image

FLUX.2 Dev

The current photoreal leader — brand-consistent, with the best text rendering and hands in the open ecosystem.

DALL-E 3

Midjourney v6+

Video

Veo 3.1

4 picks · 28 eligibleView all

Best overall video quality in the catalog: 30-second clips, native audio, and up to 4K through Vertex AI.

Runway Gen-4.5

Wan 2.7

Voice (TTS)

ElevenLabs

4 picks · 21 eligibleView all

Most expressive voices with the lowest artifact rate — the studio standard for production VO.

OpenAI TTS

Aura-2 EN

Transcription

Whisper large-v3-turbo

4 picks · 22 eligibleView all

Lowest WER on noisy real-world audio with the broadest language coverage; cheap to self-host.

Deepgram Nova-3

AssemblyAI

Music

Suno AI

4 picks · 5 eligibleView all

Most production-ready output; stems are usable and the vocals are believable.

Udio

Stability Audio Ultra

Image editing

FLUX.2 Dev

Best inpaint preservation — brand colors and untouched regions stay put across edits.

Recraft V3

SD v1.5 Inpainting