Benchmark Leaderboard
Top models by Massive Multitask Language Understanding score
Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.
| # | Model | Score | Version | Source |
|---|---|---|---|---|
| 1 | GPT-5.5 | 92.4 | — | https://enter.converge.ai/page/en-US/news/gpt-5-5-benchmarks-swe-bench-hallucination-drop |
| 2 | DeepSeek V4 Pro | 90.1 | 5-shot | https://api-docs.deepseek.com/news/news260424 |
| 3 | Xiaomi MiMo-V2.5-Pro | 89.4 | — | https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro |
| 4 | Claude 3 Opus | 88.7 | 5-shot | https://crfm.stanford.edu/helm/classic/latest/ |
| 5 | Claude 3.5 Sonnet | 88.7 | 5-shot | https://crfm.stanford.edu/helm/classic/latest/ |
| 6 | GPT-4o (05-13) | 88.7 | 5-shot | https://crfm.stanford.edu/helm/classic/latest/ |
| 7 | Llama 3.1 405B | 88.6 | 5-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 8 | Llama 3.1 405B Instruct | 88.6 | 5-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 9 | DeepSeek V3 | 88.5 | 5-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 10 | Qwen2.5-72B-Instruct | 88.2 | 5-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 11 | Grok-2 | 87.5 | 5-shot | https://x.ai/blog/grok-2 |
| 12 | GPT-4 Turbo | 86.5 | 5-shot | https://openai.com/index/gpt-4-research/ |
| 13 | Qwen2.5-32B-Instruct | 86.1 | 5-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 14 | Qwen2.5-72B | 86.1 | 5-shot | https://qwenlm.github.io/blog/qwen2.5-llm/ |
| 15 | Llama 3.1 70B Instruct | 86 | 5-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 16 | Phi-4 14B | 84.8 | — | https://huggingface.co/microsoft/phi-4 |
| 17 | Mixtral 8x22B Instruct v0.1 | 84.5 | 5-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 18 | Mixtral 8x22B v0.1 | 84.5 | 5-shot | research |
| 19 | Falcon 180B | 84.2 | 5-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 20 | Qwen2-72B | 84.2 | 5-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
Interpretation
Trust when:
- Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
- Cross-benchmark consistency is a stronger signal than one-off score dominance.
Don't trust when:
- Don't compare raw score gaps across different benchmark scales as equivalent.
- Missing rows in the heatmap can hide benchmark blind spots.
- No trend data means this is a snapshot; expect drift in fresh refresh cycles.
Cross-benchmark heatmap
Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.
| Model | MMLU | GPQA | HumanEval | HellaSwag |
|---|---|---|---|---|
| GPT-5.5 | 92.4 | 93.6 | 94.2 | — |
| DeepSeek V4 Pro | 90.1 | 90.1 | 76.8 | — |
| Xiaomi MiMo-V2.5-Pro | 89.4 | 66.7 | — | — |
| Claude 3 Opus | 88.7 | — | 84.9 | — |
| Claude 3.5 Sonnet | 88.7 | — | 92.0 | 96.2 |
| GPT-4o (05-13) | 88.7 | — | 90.2 | 96.4 |