Benchmark Leaderboard

Top models by Google-Proof Q&A score

Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.

MMLU GPQA HumanEval HellaSwag

#	Model	Score	Version	Source
1	Claude Mythos Preview	94.6	diamond	https://epoch.ai/benchmarks/gpqa-diamond; https://intuitionlabs.ai/articles/gpqa-diamond-ai-benchmark
2	GPT-5.6 Sol	94.6	GPQA Diamond; max reasoning	https://openai.com/index/gpt-5-6/
3	Gemini 3.1 Pro Preview	94.3	diamond	https://artificialanalysis.ai/leaderboards/models
4	Claude Opus 4.7	94.2	diamond	https://www.anthropic.com/news/claude-opus-4-7
5	GPT-5.5	93.6	diamond	https://openai.com/index/introducing-gpt-5-5/
6	GPT-5.5 Pro	93.6	GPQA Diamond	https://codingfleet.com/blog/claude-opus-4-8-vs-gpt-5-5-comparison/
7	Claude Opus 4.8	93.6	GPQA Diamond	https://www.anthropic.com/news/claude-opus-4-8
8	Kimi K3	93.5	GPQA-Diamond; max reasoning effort	https://www.kimi.com/blog/kimi-k3
9	MiniMax M3	92.9	MiniMax M3 GPQA 92 (accuracy%)	https://venturebeat.com/technology/minimax-m3-debuts-eclipsing-gpt-5-5-and-gemini-3-1-pro-on-key-benchmark-performance-for-just-5-10-of-the-cost
10	Qwen3.7-Max	92.4	diamond	https://www.datacamp.com/blog/qwen3-7-max
11	Gemini 3.5 Flash	92.2	GPQA Diamond (accuracy)	https://www.nxcode.io/resources/news/gemini-3-5-flash-complete-guide-benchmarks-pricing-api-2026
12	GPT-5.4	92	diamond	https://pricepertoken.com/leaderboards/benchmark/gpqa
13	Gemini 3 Pro	91.9	—	https://deepmind.google/technologies/gemini/pro/
14	Qwen3.6-Max	91.8	—	https://qwenlm.github.io/blog/qwen3.6/
15	Claude Opus 4.6	91.3	diamond	https://www.anthropic.com/claude/opus
16	GLM-5.2	91.2	diamond	https://huggingface.co/zai-org/GLM-5.2
17	Kimi K2.6	90.5	GPQA Diamond (accuracy)	https://huggingface.co/moonshotai/Kimi-K2.6
18	Gemini 3 Flash	90.4	—	https://deepmind.google/technologies/gemini/flash/
19	Qwen3.6-Plus	90.4	llm-stats shows 0 (accuracy%)	https://llm-stats.com/benchmarks/gpqa
20	DeepSeek V4 Pro	90.1	diamond	https://api-docs.deepseek.com/news/news260424

Interpretation

Trust when:

Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
Cross-benchmark consistency is a stronger signal than one-off score dominance.

Don't trust when:

Don't compare raw score gaps across different benchmark scales as equivalent.
Missing rows in the heatmap can hide benchmark blind spots.
No trend data means this is a snapshot; expect drift in fresh refresh cycles.

Cross-benchmark heatmap

Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.

Model	MMLU	GPQA	HumanEval	HellaSwag
Claude Mythos Preview	—	94.6	—	—
GPT-5.6 Sol	—	94.6	—	—
Gemini 3.1 Pro Preview	98.0	94.3	94.0	—
Claude Opus 4.7	—	94.2	—	—
GPT-5.5	92.4	93.6	94.2	—
GPT-5.5 Pro	—	93.6	—	—