LLM Reference

Benchmark Leaderboard

Top models by Massive Multitask Language Understanding score

Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.

#ModelScoreVersionSource
1GPT-5.592.4https://enter.converge.ai/page/en-US/news/gpt-5-5-benchmarks-swe-bench-hallucination-drop
2DeepSeek V4 Pro90.15-shothttps://api-docs.deepseek.com/news/news260424
3Xiaomi MiMo-V2.5-Pro89.4https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro
4Claude 3 Opus88.75-shothttps://crfm.stanford.edu/helm/classic/latest/
5Claude 3.5 Sonnet88.75-shothttps://crfm.stanford.edu/helm/classic/latest/
6GPT-4o (05-13)88.75-shothttps://crfm.stanford.edu/helm/classic/latest/
7Llama 3.1 405B88.65-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
8Llama 3.1 405B Instruct88.65-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
9DeepSeek V388.55-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
10Qwen2.5-72B-Instruct88.25-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
11Grok-287.55-shothttps://x.ai/blog/grok-2
12GPT-4 Turbo86.55-shothttps://openai.com/index/gpt-4-research/
13Qwen2.5-32B-Instruct86.15-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
14Qwen2.5-72B86.15-shothttps://qwenlm.github.io/blog/qwen2.5-llm/
15Llama 3.1 70B Instruct865-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
16Phi-4 14B84.8https://huggingface.co/microsoft/phi-4
17Mixtral 8x22B Instruct v0.184.55-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
18Mixtral 8x22B v0.184.55-shotresearch
19Falcon 180B84.25-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
20Qwen2-72B84.25-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

Interpretation

Trust when:

  • Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
  • Cross-benchmark consistency is a stronger signal than one-off score dominance.

Don't trust when:

  • Don't compare raw score gaps across different benchmark scales as equivalent.
  • Missing rows in the heatmap can hide benchmark blind spots.
  • No trend data means this is a snapshot; expect drift in fresh refresh cycles.

Cross-benchmark heatmap

Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.

ModelMMLUGPQAHumanEvalHellaSwag
GPT-5.5
92.4
93.6
94.2
DeepSeek V4 Pro
90.1
90.1
76.8
Xiaomi MiMo-V2.5-Pro
89.4
66.7
Claude 3 Opus
88.7
84.9
Claude 3.5 Sonnet
88.7
92.0
96.2
GPT-4o (05-13)
88.7
90.2
96.4