Benchmark Leaderboard
Top models by HumanEval score
Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.
| # | Model | Score | Version | Source |
|---|---|---|---|---|
| 1 | o3 | 96.7 | 2025-04 | https://openai.com/index/introducing-o3-and-o4-mini/ |
| 2 | Grok-3 | 94.5 | — | https://x.ai/blog/grok-3 |
| 3 | GPT-5.5 | 94.2 | — | https://openai.com/index/introducing-gpt-5-5/ |
| 4 | Gemini 2.5 Pro | 93.1 | 2025-03 | https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf |
| 5 | Claude 3.7 Sonnet | 93 | 2025-02 | https://www.anthropic.com/news/claude-3-7-sonnet |
| 6 | GPT-4.1 | 92.9 | 2025-04 | https://openai.com/index/gpt-4-1/ |
| 7 | Qwen2.5-Coder-32B-Instruct | 92.7 | 2024-11 | https://qwenlm.github.io/blog/qwen2.5-coder-family/ |
| 8 | Qwen3-235B-A22B | 92.7 | 2025-04 | https://qwenlm.github.io/blog/qwen3/ |
| 9 | Claude 3.5 Sonnet | 92 | pass@1 | https://crfm.stanford.edu/helm/classic/latest/ |
| 10 | Kimi K2.6 | 92 | — | https://moonshotai.github.io/Kimi-K2/ |
| 11 | Mistral Large 3 675B Instruct | 92 | — | https://mistral.ai/news/mistral-large-3/ |
| 12 | Gemini 3.5 Flash | 92 | — | https://o-mega.ai/articles/gemini-3-5-flash-benchmarks-cost-and-guide |
| 13 | GPT-4o (05-13) | 90.2 | pass@1 | https://crfm.stanford.edu/helm/classic/latest/ |
| 14 | Gemini 2.5 Flash | 90.1 | 2025-05 | https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf |
| 15 | DeepSeek R1 | 89.9 | 2025-01 | https://arxiv.org/abs/2501.12948 |
| 16 | Granite 4.1 30B | 89.63 | — | https://huggingface.co/ibm-granite/granite-4.1-30b-instruct |
| 17 | Llama 3.1 405B | 89 | pass@1 | https://ai.meta.com/blog/meta-llama-3-1/ |
| 18 | Grok-2 | 88.4 | pass@1 | https://x.ai/blog/grok-2 |
| 19 | Qwen2.5-32B-Instruct | 88.4 | pass@1 | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 20 | Qwen2.5-72B-Instruct | 86.6 | pass@1 | https://qwenlm.github.io/blog/qwen2.5-llm/ |
Interpretation
Trust when:
- Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
- Cross-benchmark consistency is a stronger signal than one-off score dominance.
Don't trust when:
- Don't compare raw score gaps across different benchmark scales as equivalent.
- Missing rows in the heatmap can hide benchmark blind spots.
- No trend data means this is a snapshot; expect drift in fresh refresh cycles.
Cross-benchmark heatmap
Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.
| Model | MMLU | GPQA | HumanEval | HellaSwag |
|---|---|---|---|---|
| o3 | — | 87.7 | 96.7 | — |
| Grok-3 | — | 84.6 | 94.5 | — |
| GPT-5.5 | 92.4 | 93.6 | 94.2 | — |
| Gemini 2.5 Pro | — | 86.4 | 93.1 | — |
| Claude 3.7 Sonnet | — | — | 93.0 | — |
| GPT-4.1 | — | — | 92.9 | — |