Benchmark Leaderboard
Top models by HellaSwag score
| # | Model | Score | Version | Source |
|---|---|---|---|---|
| 1 | GPT-4o (05-13) | 96.4 | 10-shot | HELM, Open LLM Leaderboard |
| 2 | Claude 3.5 Sonnet | 96.2 | 10-shot | HELM, official documentation |
| 3 | Llama 3.1 405B | 95.8 | 10-shot | Open LLM Leaderboard, Meta official |
| 4 | DeepSeek V3 | 95.7 | 10-shot | Open LLM Leaderboard, DeepSeek official |
| 5 | Qwen2.5 72B | 95.6 | 10-shot | research |
| 6 | Qwen2.5 72B Instruct | 95.6 | standard | Open LLM Leaderboard |
| 7 | Llama 3.1 70B Instruct | 94.2 | 10-shot | Open LLM Leaderboard |
| 8 | Mistral Medium | 93.9 | 10-shot | research |
| 9 | Mistral Large 2 | 93.8 | 10-shot | Open LLM Leaderboard |
| 10 | Mixtral 8x22B v0.1 | 93.8 | 10-shot | research |
| 11 | Falcon 180B | 92.7 | 10-shot | Open LLM Leaderboard |
| 12 | Gemma 2 27B | 92.6 | 10-shot | research |
| 13 | Llama 3 70B | 92.4 | 10-shot | research |
| 14 | Qwen2 7B | 92 | 10-shot | research |
| 15 | Mistral NeMo Instruct (2407) | 91.8 | 10-shot | Open LLM Leaderboard |
| 16 | StarCoder2 15B | 91.7 | 10-shot | research |
| 17 | DeepSeek Coder V2 Lite | 91.4 | 10-shot | Open LLM Leaderboard |
| 18 | Llama 3 8B Instruct | 91.1 | 10-shot | research |
| 19 | Mixtral 8x7B | 90.9 | 10-shot | Open LLM Leaderboard |
| 20 | Command R | 90.8 | 10-shot | research |