Benchmark Leaderboard
Top models by HellaSwag score
Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.
| # | Model | Score | Version | Source |
|---|---|---|---|---|
| 1 | GPT-4o (05-13) | 96.4 | 10-shot | https://crfm.stanford.edu/helm/classic/latest/ |
| 2 | Claude 3.5 Sonnet | 96.2 | 10-shot | https://crfm.stanford.edu/helm/classic/latest/ |
| 3 | Llama 3.1 405B | 95.8 | 10-shot | https://ai.meta.com/blog/meta-llama-3-1/ |
| 4 | DeepSeek V3 | 95.7 | 10-shot | https://arxiv.org/abs/2412.19437 |
| 5 | Qwen2.5-72B-Instruct | 95.6 | standard | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 6 | Llama 3.1 70B Instruct | 94.2 | 10-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 7 | Mistral Medium | 93.9 | 10-shot | research |
| 8 | Mistral Large 2 | 93.8 | 10-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 9 | Mixtral 8x22B v0.1 | 93.8 | 10-shot | research |
| 10 | Falcon 180B | 92.7 | 10-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 11 | Gemma 2 27B | 92.6 | 10-shot | research |
| 12 | Llama 3 70B | 92.4 | 10-shot | research |
| 13 | Qwen2-7B | 92 | 10-shot | https://arxiv.org/abs/2407.10671 |
| 14 | Mistral NeMo Instruct (2407) | 91.8 | 10-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 15 | StarCoder2 15B | 91.7 | 10-shot | research |
| 16 | DeepSeek Coder V2 Lite | 91.4 | 10-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 17 | Llama 3 8B Instruct | 91.1 | 10-shot | research |
| 18 | Mixtral 8x7B | 90.9 | 10-shot | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard |
| 19 | Command R | 90.8 | 10-shot | research |
| 20 | Phi-3 Small 128K | 90.8 | 10-shot | https://arxiv.org/abs/2404.14219 |
Interpretation
Trust when:
- Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
- Cross-benchmark consistency is a stronger signal than one-off score dominance.
Don't trust when:
- Don't compare raw score gaps across different benchmark scales as equivalent.
- Missing rows in the heatmap can hide benchmark blind spots.
- No trend data means this is a snapshot; expect drift in fresh refresh cycles.
Cross-benchmark heatmap
Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.
| Model | MMLU | GPQA | HumanEval | HellaSwag |
|---|---|---|---|---|
| GPT-4o (05-13) | 88.7 | — | 90.2 | 96.4 |
| Claude 3.5 Sonnet | 88.7 | — | 92.0 | 96.2 |
| Llama 3.1 405B | 88.6 | 51.5 | 89.0 | 95.8 |
| DeepSeek V3 | 88.5 | — | 85.5 | 95.7 |
| Qwen2.5-72B-Instruct | 88.2 | 38.4 | 86.6 | 95.6 |
| Llama 3.1 70B Instruct | 86.0 | — | 84.1 | 94.2 |