Benchmark Leaderboard
Top models by Massive Multitask Language Understanding score
| # | Model | Score | Version | Source |
|---|---|---|---|---|
| 1 | Claude 3 Opus | 88.7 | 5-shot | HELM |
| 2 | Claude 3.5 Sonnet | 88.7 | 5-shot | HELM, official documentation |
| 3 | GPT-4o (05-13) | 88.7 | 5-shot | HELM, Open LLM Leaderboard |
| 4 | Llama 3.1 405B | 88.6 | 5-shot | Open LLM Leaderboard |
| 5 | Llama 3.1 405B Instruct | 88.6 | 5-shot | Open LLM Leaderboard |
| 6 | DeepSeek V3 | 88.5 | 5-shot | Open LLM Leaderboard |
| 7 | Qwen2.5 72B | 88.2 | 5-shot | research |
| 8 | Qwen2.5 72B Instruct | 88.2 | 5-shot | Open LLM Leaderboard |
| 9 | Grok-2 | 87.5 | 5-shot | Open LLM Leaderboard, xAI official |
| 10 | GPT-4 Turbo | 86.5 | 5-shot | OpenAI |
| 11 | Qwen2.5 32B Instruct | 86.1 | 5-shot | Open LLM Leaderboard |
| 12 | Llama 3.1 70B Instruct | 86 | 5-shot | Open LLM Leaderboard |
| 13 | Mixtral 8x22B Instruct v0.1 | 84.5 | 5-shot | Open LLM Leaderboard |
| 14 | Mixtral 8x22B v0.1 | 84.5 | 5-shot | research |
| 15 | Falcon 180B | 84.2 | 5-shot | Open LLM Leaderboard |
| 16 | Qwen2 72B | 84.2 | 5-shot | Open LLM Leaderboard |
| 17 | Qwen2.5 14B Instruct | 84.2 | 5-shot | Open LLM Leaderboard |
| 18 | Mistral Large 2 | 84 | 5-shot | Open LLM Leaderboard |
| 19 | Mistral Medium | 82.9 | 5-shot | research |
| 20 | Gemma 2 27B Instruct | 82.3 | 5-shot | Open LLM Leaderboard |