Benchmark Leaderboard

Top models by HumanEval score

Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.

MMLU GPQA HumanEval HellaSwag

#	Model	Score	Version	Source
1	o3	96.7	2025-04	https://openai.com/index/introducing-o3-and-o4-mini/
2	Grok-3	94.5	—	https://x.ai/blog/grok-3
3	GPT-5.5	94.2	—	https://openai.com/index/introducing-gpt-5-5/
4	Gemini 2.5 Pro	93.1	2025-03	https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf
5	Claude 3.7 Sonnet	93	2025-02	https://www.anthropic.com/news/claude-3-7-sonnet
6	GPT-4.1	92.9	2025-04	https://openai.com/index/gpt-4-1/
7	Qwen2.5-Coder-32B-Instruct	92.7	2024-11	https://qwenlm.github.io/blog/qwen2.5-coder-family/
8	Qwen3-235B-A22B	92.7	2025-04	https://qwenlm.github.io/blog/qwen3/
9	Claude 3.5 Sonnet	92	pass@1	https://crfm.stanford.edu/helm/classic/latest/
10	Kimi K2.6	92	—	https://moonshotai.github.io/Kimi-K2/
11	Mistral Large 3 675B Instruct	92	—	https://mistral.ai/news/mistral-large-3/
12	Gemini 3.5 Flash	92	—	https://o-mega.ai/articles/gemini-3-5-flash-benchmarks-cost-and-guide
13	GPT-4o (05-13)	90.2	pass@1	https://crfm.stanford.edu/helm/classic/latest/
14	Gemini 2.5 Flash	90.1	2025-05	https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf
15	DeepSeek R1	89.9	2025-01	https://arxiv.org/abs/2501.12948
16	Granite 4.1 30B	89.63	—	https://huggingface.co/ibm-granite/granite-4.1-30b-instruct
17	Llama 3.1 405B	89	pass@1	https://ai.meta.com/blog/meta-llama-3-1/
18	Grok-2	88.4	pass@1	https://x.ai/blog/grok-2
19	Qwen2.5-32B-Instruct	88.4	pass@1	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
20	Qwen2.5-72B-Instruct	86.6	pass@1	https://qwenlm.github.io/blog/qwen2.5-llm/

Interpretation

Trust when:

Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
Cross-benchmark consistency is a stronger signal than one-off score dominance.

Don't trust when:

Don't compare raw score gaps across different benchmark scales as equivalent.
Missing rows in the heatmap can hide benchmark blind spots.
No trend data means this is a snapshot; expect drift in fresh refresh cycles.

Cross-benchmark heatmap

Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.

Model	MMLU	GPQA	HumanEval	HellaSwag
o3	—	87.7	96.7	—
Grok-3	—	84.6	94.5	—
GPT-5.5	92.4	93.6	94.2	—
Gemini 2.5 Pro	—	86.4	93.1	—
Claude 3.7 Sonnet	—	—	93.0	—
GPT-4.1	—	—	92.9	—