LLM Reference

Benchmark Leaderboard

Top models by HumanEval score

Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.

#ModelScoreVersionSource
1o396.72025-04https://openai.com/index/introducing-o3-and-o4-mini/
2Grok-394.5https://x.ai/blog/grok-3
3GPT-5.594.2https://openai.com/index/introducing-gpt-5-5/
4Gemini 2.5 Pro93.12025-03https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf
5Claude 3.7 Sonnet932025-02https://www.anthropic.com/news/claude-3-7-sonnet
6GPT-4.192.92025-04https://openai.com/index/gpt-4-1/
7Qwen2.5-Coder-32B-Instruct92.72024-11https://qwenlm.github.io/blog/qwen2.5-coder-family/
8Qwen3-235B-A22B92.72025-04https://qwenlm.github.io/blog/qwen3/
9Claude 3.5 Sonnet92pass@1https://crfm.stanford.edu/helm/classic/latest/
10Kimi K2.692https://moonshotai.github.io/Kimi-K2/
11Mistral Large 3 675B Instruct92https://mistral.ai/news/mistral-large-3/
12Gemini 3.5 Flash92https://o-mega.ai/articles/gemini-3-5-flash-benchmarks-cost-and-guide
13GPT-4o (05-13)90.2pass@1https://crfm.stanford.edu/helm/classic/latest/
14Gemini 2.5 Flash90.12025-05https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf
15DeepSeek R189.92025-01https://arxiv.org/abs/2501.12948
16Granite 4.1 30B89.63https://huggingface.co/ibm-granite/granite-4.1-30b-instruct
17Llama 3.1 405B89pass@1https://ai.meta.com/blog/meta-llama-3-1/
18Grok-288.4pass@1https://x.ai/blog/grok-2
19Qwen2.5-32B-Instruct88.4pass@1https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
20Qwen2.5-72B-Instruct86.6pass@1https://qwenlm.github.io/blog/qwen2.5-llm/

Interpretation

Trust when:

  • Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
  • Cross-benchmark consistency is a stronger signal than one-off score dominance.

Don't trust when:

  • Don't compare raw score gaps across different benchmark scales as equivalent.
  • Missing rows in the heatmap can hide benchmark blind spots.
  • No trend data means this is a snapshot; expect drift in fresh refresh cycles.

Cross-benchmark heatmap

Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.

ModelMMLUGPQAHumanEvalHellaSwag
o3
87.7
96.7
Grok-3
84.6
94.5
GPT-5.5
92.4
93.6
94.2
Gemini 2.5 Pro
86.4
93.1
Claude 3.7 Sonnet
93.0
GPT-4.1
92.9