Benchmark Leaderboard
Top models by HumanEval score
| # | Model | Score | Version | Source |
|---|---|---|---|---|
| 1 | o3 | 96.7 | 2025-04 | OpenAI |
| 2 | Gemini 2.5 Pro | 93.1 | 2025-03 | Google DeepMind |
| 3 | Claude 3.7 Sonnet | 93 | 2025-02 | Anthropic |
| 4 | GPT-4.1 | 92.9 | 2025-04 | OpenAI |
| 5 | Qwen2.5 72B | 92.7 | pass@1 | research |
| 6 | Qwen2.5 72B Instruct | 92.7 | pass@1 | Open LLM Leaderboard |
| 7 | Qwen2.5 Coder 32B Instruct | 92.7 | 2024-11 | Qwen Team |
| 8 | Qwen3 235B A22B | 92.7 | 2025-04 | Qwen Team |
| 9 | Claude 3.5 Sonnet | 92 | pass@1 | HELM, official documentation |
| 10 | GPT-4o (05-13) | 90.2 | pass@1 | HELM, Open LLM Leaderboard |
| 11 | Gemini 2.5 Flash | 90.1 | 2025-05 | Google DeepMind |
| 12 | DeepSeek R1 | 89.9 | 2025-01 | DeepSeek |
| 13 | Llama 3.1 405B | 89 | pass@1 | Open LLM Leaderboard, Meta official |
| 14 | Grok-2 | 88.4 | pass@1 | Open LLM Leaderboard, xAI official |
| 15 | Qwen2.5 32B Instruct | 88.4 | pass@1 | Open LLM Leaderboard |
| 16 | Mixtral 8x22B Instruct v0.1 | 86.2 | pass@1 | Open LLM Leaderboard |
| 17 | Mixtral 8x22B v0.1 | 86.2 | pass@1 | research |
| 18 | DeepSeek V3 | 85.5 | pass@1 | Open LLM Leaderboard, DeepSeek official |
| 19 | DeepSeek V3 0324 | 85.5 | 2025-03 | DeepSeek |
| 20 | Falcon 180B | 85.1 | pass@1 | Open LLM Leaderboard |