LLM ReferenceLLM Reference

Benchmark Leaderboard

Top models by Massive Multitask Language Understanding score

#ModelScoreVersionSource
1Claude 3 Opus88.75-shotHELM
2Claude 3.5 Sonnet88.75-shotHELM, official documentation
3GPT-4o (05-13)88.75-shotHELM, Open LLM Leaderboard
4Llama 3.1 405B88.65-shotOpen LLM Leaderboard
5Llama 3.1 405B Instruct88.65-shotOpen LLM Leaderboard
6DeepSeek V388.55-shotOpen LLM Leaderboard
7Qwen2.5 72B88.25-shotresearch
8Qwen2.5 72B Instruct88.25-shotOpen LLM Leaderboard
9Grok-287.55-shotOpen LLM Leaderboard, xAI official
10GPT-4 Turbo86.55-shotOpenAI
11Qwen2.5 32B Instruct86.15-shotOpen LLM Leaderboard
12Llama 3.1 70B Instruct865-shotOpen LLM Leaderboard
13Mixtral 8x22B Instruct v0.184.55-shotOpen LLM Leaderboard
14Mixtral 8x22B v0.184.55-shotresearch
15Falcon 180B84.25-shotOpen LLM Leaderboard
16Qwen2 72B84.25-shotOpen LLM Leaderboard
17Qwen2.5 14B Instruct84.25-shotOpen LLM Leaderboard
18Mistral Large 2845-shotOpen LLM Leaderboard
19Mistral Medium82.95-shotresearch
20Gemma 2 27B Instruct82.35-shotOpen LLM Leaderboard