LLM Reference

Benchmark Leaderboard

Top models by HellaSwag score

Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.

#ModelScoreVersionSource
1GPT-4o (05-13)96.410-shothttps://crfm.stanford.edu/helm/classic/latest/
2Claude 3.5 Sonnet96.210-shothttps://crfm.stanford.edu/helm/classic/latest/
3Llama 3.1 405B95.810-shothttps://ai.meta.com/blog/meta-llama-3-1/
4DeepSeek V395.710-shothttps://arxiv.org/abs/2412.19437
5Qwen2.5-72B-Instruct95.6standardhttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
6Llama 3.1 70B Instruct94.210-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
7Mistral Medium93.910-shotresearch
8Mistral Large 293.810-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
9Mixtral 8x22B v0.193.810-shotresearch
10Falcon 180B92.710-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
11Gemma 2 27B92.610-shotresearch
12Llama 3 70B92.410-shotresearch
13Qwen2-7B9210-shothttps://arxiv.org/abs/2407.10671
14Mistral NeMo Instruct (2407)91.810-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
15StarCoder2 15B91.710-shotresearch
16DeepSeek Coder V2 Lite91.410-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
17Llama 3 8B Instruct91.110-shotresearch
18Mixtral 8x7B90.910-shothttps://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
19Command R90.810-shotresearch
20Phi-3 Small 128K90.810-shothttps://arxiv.org/abs/2404.14219

Interpretation

Trust when:

  • Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
  • Cross-benchmark consistency is a stronger signal than one-off score dominance.

Don't trust when:

  • Don't compare raw score gaps across different benchmark scales as equivalent.
  • Missing rows in the heatmap can hide benchmark blind spots.
  • No trend data means this is a snapshot; expect drift in fresh refresh cycles.

Cross-benchmark heatmap

Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.

ModelMMLUGPQAHumanEvalHellaSwag
GPT-4o (05-13)
88.7
90.2
96.4
Claude 3.5 Sonnet
88.7
92.0
96.2
Llama 3.1 405B
88.6
51.5
89.0
95.8
DeepSeek V3
88.5
85.5
95.7
Qwen2.5-72B-Instruct
88.2
38.4
86.6
95.6
Llama 3.1 70B Instruct
86.0
84.1
94.2