Benchmark Leaderboard
Top models by Google-Proof Q&A score
Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.
| # | Model | Score | Version | Source |
|---|---|---|---|---|
| 1 | Claude Mythos Preview | 94.6 | diamond | https://epoch.ai/benchmarks/gpqa-diamond; https://intuitionlabs.ai/articles/gpqa-diamond-ai-benchmark |
| 2 | Gemini 3.1 Pro Preview | 94.3 | diamond | https://artificialanalysis.ai/leaderboards/models |
| 3 | Claude Opus 4.7 | 94.2 | diamond | https://www.anthropic.com/news/claude-opus-4-7 |
| 4 | GPT-5.5 | 93.6 | diamond | https://openai.com/index/introducing-gpt-5-5/ |
| 5 | GPT-5.5 Pro | 93.6 | diamond | https://openai.com/index/introducing-gpt-5-5/ |
| 6 | Claude Opus 4.8 | 93.6 | GPQA Diamond | https://llm-stats.com/blog/research/claude-opus-4-8-launch |
| 7 | Qwen3.7-Max | 92.4 | diamond | https://www.datacamp.com/blog/qwen3-7-max |
| 8 | GPT-5.4 | 92 | diamond | https://pricepertoken.com/leaderboards/benchmark/gpqa |
| 9 | Gemini 3 Pro | 91.9 | — | https://deepmind.google/technologies/gemini/pro/ |
| 10 | Qwen3.6-Max | 91.8 | — | https://qwenlm.github.io/blog/qwen3.6/ |
| 11 | Claude Opus 4.6 | 91.3 | diamond | https://www.anthropic.com/claude/opus |
| 12 | Kimi K2.6 | 90.5 | — | https://moonshotai.github.io/Kimi-K2/ |
| 13 | Gemini 3 Flash | 90.4 | — | https://deepmind.google/technologies/gemini/flash/ |
| 14 | DeepSeek V4 Pro | 90.1 | diamond | https://www.datacamp.com/blog/deepseek-v4 |
| 15 | Grok 4.3 | 90.1 | — | https://x.ai/blog/grok-4-3 |
| 16 | Claude Sonnet 4.6 | 89.9 | diamond | https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf |
| 17 | Muse Spark | 89.5 | diamond | https://datacamp.com/blog/muse-spark-review; https://labellerr.com/blog/muse-spark-benchmarks/ |
| 18 | Qwen3.5-397B-A17B | 89.3 | diamond | Artificial Analysis |
| 19 | Trinity-Large-Thinking | 89.2 | diamond | https://docs.arcee.ai/language-models/trinity-large-thinking |
| 20 | ByteDance Doubao Seed 2.0 Pro | 88.9 | diamond | https://seed.bytedance.com/seed2 |
Interpretation
Trust when:
- Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
- Cross-benchmark consistency is a stronger signal than one-off score dominance.
Don't trust when:
- Don't compare raw score gaps across different benchmark scales as equivalent.
- Missing rows in the heatmap can hide benchmark blind spots.
- No trend data means this is a snapshot; expect drift in fresh refresh cycles.
Cross-benchmark heatmap
Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.
| Model | MMLU | GPQA | HumanEval | HellaSwag |
|---|---|---|---|---|
| Claude Mythos Preview | — | 94.6 | — | — |
| Gemini 3.1 Pro Preview | — | 94.3 | — | — |
| Claude Opus 4.7 | — | 94.2 | — | — |
| GPT-5.5 | 92.4 | 93.6 | 94.2 | — |
| GPT-5.5 Pro | — | 93.6 | — | — |
| Claude Opus 4.8 | — | 93.6 | — | — |