LLM Reference

Benchmark Leaderboard

Top models by Google-Proof Q&A score

Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.

#ModelScoreVersionSource
1Claude Mythos Preview94.6diamondhttps://epoch.ai/benchmarks/gpqa-diamond; https://intuitionlabs.ai/articles/gpqa-diamond-ai-benchmark
2Gemini 3.1 Pro Preview94.3diamondhttps://artificialanalysis.ai/leaderboards/models
3Claude Opus 4.794.2diamondhttps://www.anthropic.com/news/claude-opus-4-7
4GPT-5.593.6diamondhttps://openai.com/index/introducing-gpt-5-5/
5GPT-5.5 Pro93.6diamondhttps://openai.com/index/introducing-gpt-5-5/
6Claude Opus 4.893.6GPQA Diamondhttps://llm-stats.com/blog/research/claude-opus-4-8-launch
7Qwen3.7-Max92.4diamondhttps://www.datacamp.com/blog/qwen3-7-max
8GPT-5.492diamondhttps://pricepertoken.com/leaderboards/benchmark/gpqa
9Gemini 3 Pro91.9https://deepmind.google/technologies/gemini/pro/
10Qwen3.6-Max91.8https://qwenlm.github.io/blog/qwen3.6/
11Claude Opus 4.691.3diamondhttps://www.anthropic.com/claude/opus
12Kimi K2.690.5https://moonshotai.github.io/Kimi-K2/
13Gemini 3 Flash90.4https://deepmind.google/technologies/gemini/flash/
14DeepSeek V4 Pro90.1diamondhttps://www.datacamp.com/blog/deepseek-v4
15Grok 4.390.1https://x.ai/blog/grok-4-3
16Claude Sonnet 4.689.9diamondhttps://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf
17Muse Spark89.5diamondhttps://datacamp.com/blog/muse-spark-review; https://labellerr.com/blog/muse-spark-benchmarks/
18Qwen3.5-397B-A17B89.3diamondArtificial Analysis
19Trinity-Large-Thinking89.2diamondhttps://docs.arcee.ai/language-models/trinity-large-thinking
20ByteDance Doubao Seed 2.0 Pro88.9diamondhttps://seed.bytedance.com/seed2

Interpretation

Trust when:

  • Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
  • Cross-benchmark consistency is a stronger signal than one-off score dominance.

Don't trust when:

  • Don't compare raw score gaps across different benchmark scales as equivalent.
  • Missing rows in the heatmap can hide benchmark blind spots.
  • No trend data means this is a snapshot; expect drift in fresh refresh cycles.

Cross-benchmark heatmap

Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.

ModelMMLUGPQAHumanEvalHellaSwag
Claude Mythos Preview
94.6
Gemini 3.1 Pro Preview
94.3
Claude Opus 4.7
94.2
GPT-5.5
92.4
93.6
94.2
GPT-5.5 Pro
93.6
Claude Opus 4.8
93.6