Benchmark Leaderboard

Top models by HellaSwag score

Sorts are only valid inside this benchmark. A score gap of this benchmark does not map to the same gap on another benchmark, so validate any short-list by checking peer rows and model context/price fit.

MMLU GPQA HumanEval HellaSwag

#	Model	Score	Version	Source
1	GPT-4o (05-13)	96.4	10-shot	https://crfm.stanford.edu/helm/classic/latest/
2	Claude 3.5 Sonnet	96.2	10-shot	https://crfm.stanford.edu/helm/classic/latest/
3	Llama 3.1 405B	95.8	10-shot	https://ai.meta.com/blog/meta-llama-3-1/
4	DeepSeek V3	95.7	10-shot	https://arxiv.org/abs/2412.19437
5	Qwen2.5-72B-Instruct	95.6	standard	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
6	Llama 3.1 70B Instruct	94.2	10-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
7	Mistral Medium	93.9	10-shot	research
8	Mistral Large 2	93.8	10-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
9	Mixtral 8x22B v0.1	93.8	10-shot	research
10	Falcon 180B	92.7	10-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
11	Gemma 2 27B	92.6	10-shot	research
12	Llama 3 70B	92.4	10-shot	research
13	Qwen2-7B	92	10-shot	https://arxiv.org/abs/2407.10671
14	Mistral NeMo Instruct (2407)	91.8	10-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
15	StarCoder2 15B	91.7	10-shot	research
16	DeepSeek Coder V2 Lite	91.4	10-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
17	Llama 3 8B Instruct	91.1	10-shot	research
18	Mixtral 8x7B	90.9	10-shot	https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
19	Command R	90.8	10-shot	research
20	Phi-3 Small 128K	90.8	10-shot	https://arxiv.org/abs/2404.14219

Interpretation

Trust when:

Use top rows for shortlisting, then validate provider coverage and feature fit before selecting a model.
Cross-benchmark consistency is a stronger signal than one-off score dominance.

Don't trust when:

Don't compare raw score gaps across different benchmark scales as equivalent.
Missing rows in the heatmap can hide benchmark blind spots.
No trend data means this is a snapshot; expect drift in fresh refresh cycles.

Cross-benchmark heatmap

Compare top models across selected benchmark families to spot strengths and weak spots. On narrow screens, scroll horizontally to inspect every benchmark.

Model	MMLU	GPQA	HumanEval	HellaSwag
GPT-4o (05-13)	88.7	—	90.2	96.4
Claude 3.5 Sonnet	88.7	—	92.0	96.2
Llama 3.1 405B	88.6	51.5	89.0	95.8
DeepSeek V3	88.5	—	85.5	95.7
Qwen2.5-72B-Instruct	88.2	38.4	86.6	95.6
Llama 3.1 70B Instruct	86.0	—	84.1	94.2