Benchmarks
73 evaluation benchmarks for language models
Arena
Coding
HumanEval
HumanEval
Evaluates code generation abilities
MBPP
Mostly Basic Programming Problems
Tests code generation in Python
HumanEval+
HumanEval+
MBPP+
Mostly Basic Programming Problems+
Spider
Spider
SQL generation
ToolTalk
ToolTalk
Evaluates Tool LLMs in a conversational setting
Composite
General
Knowledge
Long context
Mathematics
Multilingual
Reasoning
GPQA
Google-Proof Q&A
Evaluates graduate level reasoning capabilities
DROP
Discrete Reasoning Over Paragraphs
Tests reasoning over text
ARC
AI2 Reasoning Challenge
Tests knowledge-based question answering
HellaSwag
HellaSwag
Evaluates common sense reasoning
ANLS*
Average Normalized Levenshtein Similarity*
Evaluates visual question answering on documents
TruthfulQA
TruthfulQA
Assesses the truthfulness of model responses
WinoGrande
WinoGrande
Tests common sense reasoning
ARC-E
AI2 Reasoning Challenge-Easy
BoolQ
Boolean Questions
CommonsenseQA
CommonsenseQA
Safety
Tool use
Understanding
Other
MMMU
Massive Multi-discipline Multimodal Understanding
Tests math and reasoning skills on multimodal data
MathVista
MathVista
Tests mathematical reasoning abilities
MedQA
MedQA
OpenBookQA
OpenBookQA
PIQA
Physical Interaction: Question Answering
SocialIQA
Social Intelligence QA
COPA
Choice of Plausible Alternatives
LAMBADA
LAnguage Modeling Broadened to Account for Discourse Aspects
RACE
ReAding Comprehension Dataset From Examinations
SQuAD
Stanford Question Answering Dataset
QuAC
Question Answering in Context
NaturalQ
Natural Questions
MultiNLI
Multi-Genre Natural Language Inference
SciQ
SciQ
CRASS
Counterfactual Reasoning Assessment
ACI-BENCH
Ambient Clinical Intelligence Benchmark
MS-MARCO
MAchine Reading COmprehension Dataset
QMSum
Query-based Multi-domain Meeting Summarization
HHH
Helpfulness, Honesty, Harmlessness
RAI
Responsible AI
CodeXGLUE
CodeXGLUE
LLM Judge
LLM Judge
LLM-Eval
LLM-Eval
JudgeLM
JudgeLM
Prometheus
Prometheus
DocVQA
DocVQA
C-Eval
Chinese Evaluation Suite
C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels.
CMMLU
Chinese Massive Multitask Language Understanding
CMMLU is a comprehensive evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of Chinese language and culture.
GAOKAO
GAOKAO
An evaluation benchmark based on a dataset of Chinese college entrance examination questions.
GAOKAO-MM
GAOKAO-MM
A multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO).