Benchmark Leaderboard

Top models by Massive Multitask Language Understanding score

Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.

MMLU GPQA HumanEval HellaSwag

#	Model	Score	Version	Source
1	GPT-5.5	92.4	—	https://enter.converge.ai/page/en-US/news/gpt-5-5-benchmarks-swe-bench-hallucination-drop
2	DeepSeek V4 Pro	90.1	5-shot	https://api-docs.deepseek.com/news/news260424
3	Xiaomi MiMo-V2.5-Pro	89.4	—	https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro
4	Claude 3 Opus	88.7	5-shot	https://crfm.stanford.edu/helm/classic/latest/
5	Claude 3.5 Sonnet	88.7	5-shot	https://crfm.stanford.edu/helm/classic/latest/
6	GPT-4o (05-13)	88.7	5-shot	https://crfm.stanford.edu/helm/classic/latest/
7	Llama 3.1 405B	88.6	5-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
8	Llama 3.1 405B Instruct	88.6	5-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
9	DeepSeek V3	88.5	5-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
10	Qwen2.5-72B-Instruct	88.2	5-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
11	Grok-2	87.5	5-shot	https://x.ai/blog/grok-2
12	GPT-4 Turbo	86.5	5-shot	https://openai.com/index/gpt-4-research/
13	Qwen2.5-32B-Instruct	86.1	5-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
14	Qwen2.5-72B	86.1	5-shot	https://qwenlm.github.io/blog/qwen2.5-llm/
15	Llama 3.1 70B Instruct	86	5-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
16	Phi-4 14B	84.8	—	https://huggingface.co/microsoft/phi-4
17	Mixtral 8x22B Instruct v0.1	84.5	5-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
18	Mixtral 8x22B v0.1	84.5	5-shot	research
19	Falcon 180B	84.2	5-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
20	Qwen2-72B	84.2	5-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

Interpretation

Trust when:

Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
Cross-benchmark consistency is a stronger signal than one-off score dominance.

Don't trust when:

Don't compare raw score gaps across different benchmark scales as equivalent.
Missing rows in the heatmap can hide benchmark blind spots.
No trend data means this is a snapshot; expect drift in fresh refresh cycles.

Cross-benchmark heatmap

Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.

Model	MMLU	GPQA	HumanEval	HellaSwag
GPT-5.5	92.4	93.6	94.2	—
DeepSeek V4 Pro	90.1	90.1	76.8	—
Xiaomi MiMo-V2.5-Pro	89.4	66.7	—	—
Claude 3 Opus	88.7	—	84.9	—
Claude 3.5 Sonnet	88.7	—	92.0	96.2
GPT-4o (05-13)	88.7	—	90.2	96.4