Benchmarks
88 evaluation benchmarks for language models
Agent
LoCoMo
Long-context memory benchmark for conversational agents evaluating ability to recall and reason over extended multi-session dialogue histories.
RealTalk
Real-world dialogue evaluation benchmark for conversational agents. Limited public documentation available.
ADRS (UCB-ADRS)
UC Berkeley agentic decision-making reliability and safety benchmark testing coherence, persistence, and safety in long-horizon tasks. NOTE: slug contains parentheses — recommend renaming to 'adrs'.
Arena
Chatbot Arena
Chatbot Arena
Crowdsourced human preference leaderboard using pairwise blind comparisons of LLM responses. Operated by LMSYS/UC Berkeley; rebranded and moved to lmarena.ai in 2024.
MT-Bench
MT-Bench
80 multi-turn conversation questions across 8 categories evaluated by GPT-4 as judge. Scores range from 1-10 per turn.
Coding
HumanEval
HumanEval
164 Python coding problems measuring functional correctness of code generation via pass@k metric. Released by OpenAI in 2021; HumanEval+ provides a more rigorous extension.
MBPP
Mostly Basic Programming Problems
974 Python programming problems for evaluating code generation, ranging from beginner to intermediate difficulty.
HumanEval+
HumanEval+
HumanEval+ extends HumanEval with an average of 764 test cases per problem (vs 9.6 in original), greatly increasing evaluation rigor for edge case coverage.
MBPP+
Mostly Basic Programming Problems+
MBPP+ extends MBPP with 21.2 average test cases per problem, a 35x increase over the original for more rigorous code generation evaluation.
Spider
Spider
Large-scale text-to-SQL benchmark with 10,181 questions across 200 databases from 138 domains, requiring cross-domain generalization.
ToolTalk
ToolTalk
Conversational tool-use benchmark with 78 conversations requiring multi-step API calls across 28 tools including calendar, email, and messaging.
LiveCodeBench
Continuously updated coding benchmark sourcing new problems from LeetCode, AtCoder, and Codeforces post-May 2023 to eliminate contamination. Evaluates code generation, self-repair, and execution prediction.
CRUXEval
800 Python function input-output reasoning problems testing code execution understanding rather than code generation. Two tasks: input prediction and output prediction.
SWE-bench Verified
500 human-validated GitHub issue resolution tasks from SWE-bench, created with OpenAI in August 2024. The standard evaluation for agentic coding systems. Top performers (2026) exceed 78% resolved.
Code editing across 8 programming languages
Aider Polyglot
Real-world code editing benchmark measuring a model's ability to apply changes to existing codebases across 8 programming languages using Exercism platform exercises.
1,140 complex Python programming tasks
BigCodeBench
1,140 complex Python programming tasks spanning diverse real-world domains requiring multi-library function calls. Two variants: Complete (function completion) and Instruct (natural language to code).
SWE-bench Pro
731-task multilingual real-world GitHub issue benchmark extending SWE-bench Verified with harder, more diverse tasks across Python, JavaScript, TypeScript, Java, Go, C++, and Rust.
Composite
BBH
BIG-Bench Hard
23 challenging tasks from BIG-Bench where even large models scored below human performance without chain-of-thought. Tests logical deduction, causal judgment, and formal fallacies.
WildBench
WildBench
Automated LLM evaluation using 1,024 real user queries from WildChat with LLM-as-judge scoring. Achieves 0.98 Spearman correlation with Chatbot Arena Elo. V2 with periodic updates.
AGIEval
Artificial General Intelligence Eval
Evaluation suite based on 20 official human-centric standardized exams (SAT, LSAT, GRE, bar exam) spanning English and Chinese.
General
MMLU
Massive Multitask Language Understanding
Tests LLMs on undergraduate to professional level knowledge across 57 subjects with 15,908 multiple-choice questions. Top models now saturate at 86–90%; MMLU-Pro is the recommended harder successor.
IFEval
Instruction-Following Evaluation
IFEval measures LLM ability to follow verifiable formatting instructions (e.g., output length, keyword inclusion, capitalization) with exact-match checking.
MMLU PRO
MMLU PRO
Enhanced MMLU with 10-option questions (vs 4 in original), harder problems filtered through expert review, and reduced shortcuts. Maintains coverage of 57 academic subjects.
Holistic
Knowledge
Long context
ZeroSCROLLS/QuALITY
ZeroSCROLLS/QuALITY
ZeroSCROLLS suite's QuALITY subset: multiple-choice QA on long-form documents (articles, books) requiring reading up to 10K tokens for answer derivation.
InfiniteBench/En.MC
InfiniteBench/En.MC
English multiple-choice subset of InfiniteBench designed for 100K+ token context evaluation, requiring retrieval and reasoning over very long documents.
NIH/Multi-needle
NIH/Multi-needle
Multi-needle retrieval benchmark extending the 'needle in a haystack' test to require simultaneous retrieval of multiple facts embedded in long contexts.
RULER
RULER
Flexible benchmark testing effective context utilization across 13 diverse long-context tasks including retrieval, multi-hop reasoning, and aggregation at configurable lengths.
Mathematics
GSM8K
Grade School Math 8K
8,500 grade school math word problems requiring 2–8 steps of arithmetic reasoning. Widely used for evaluating chain-of-thought reasoning; top models now near 100%.
MATH
Mathematics Aptitude Test of Heuristics
12,500 challenging competition mathematics problems from AMC, AIME, and other high-school math competitions across 7 difficulty levels.
Multilingual
Multimodal
Reasoning
GPQA
Google-Proof Q&A
PhD-level multiple-choice questions in biology, physics, and chemistry designed so that even experts with internet access score below 67%. The Diamond subset (198 questions) is the hardest variant used in most frontier model evaluations.
DROP
Discrete Reasoning Over Paragraphs
Reading comprehension benchmark requiring numerical reasoning, counting, sorting, and set operations over paragraphs.
ARC
AI2 Reasoning Challenge
Grade-school science multiple-choice questions partitioned into Easy and Challenge (hard) sets. ARC-Challenge is the standard evaluation variant.
HellaSwag
HellaSwag
Commonsense sentence-completion benchmark using adversarially filtered wrong answers. Top LLMs now exceed 95% accuracy.
ANLS*
Average Normalized Levenshtein Similarity*
Extended ANLS metric for evaluating visual question answering on documents, supporting free-form answer evaluation beyond exact match.
TruthfulQA
TruthfulQA
817 questions spanning 38 categories where humans commonly hold misconceptions. Measures whether models give truthful answers or reproduce popular falsehoods.
WinoGrande
WinoGrande
44,000 adversarially filtered Winograd schema problems for commonsense reasoning, created at scale to reduce dataset biases.
ARC-E
AI2 Reasoning Challenge-Easy
The easy partition of ARC with 7,787 grade-school science multiple-choice questions answerable without complex reasoning.
BoolQ
Boolean Questions
15,942 yes/no reading comprehension questions derived from real Google searches paired with Wikipedia passages.
CommonsenseQA
CommonsenseQA
12,102 commonsense reasoning multiple-choice questions crowdsourced using ConceptNet to target implicit world knowledge.
OlympiadBench
8,952 bilingual (Chinese/English) math and physics Olympiad problems from national/international competitions, testing advanced scientific reasoning with precise symbolic answers.
MathVerse
2,612 visual math problems across 5 versions (text dominant, text-lite, text-only, vision-intensive, vision-only) evaluating multimodal mathematical reasoning.
ARC Prize / ARC Challenge
Abstraction and Reasoning Corpus (ARC-AGI) with visual grid puzzles testing core reasoning and novel problem-solving. ARC Prize 2024-2025 competitions track annual progress. NOTE: slug contains invalid characters '/' — recommend renaming to 'arc-prize'.
DynaBench
Dynamic adversarial data collection platform generating ever-harder NLU benchmarks through human-and-model-in-the-loop annotation to prevent contamination.
Safety
ToxiGen
ToxiGen
274,000 machine-generated toxic and benign statements about 13 minority groups, using adversarial classifier-in-the-loop decoding for implicit hate speech evaluation.
RealToxicity
RealToxicity
100,000 naturally occurring web text prompts for measuring the propensity of language models to generate toxic continuations using the Perspective API.
CrowS-Pairs
CrowS-Pairs
1,508 crowdsourced sentence pairs measuring stereotypical bias in masked language models across 9 categories including race, gender, and religion.
BBQ Ambig
BBQ Ambig
Ambiguous context subset of the Bias Benchmark for QA (BBQ), measuring how models respond to social bias questions when context is underspecified.
BBQ Disambig
BBQ Disambig
Disambiguated context subset of BBQ that tests whether additional clarifying context reduces biased model predictions.
Winogender
Winogender
720 Winograd-style schemas testing gender bias in pronoun resolution, checking whether models associate professions with gender stereotypes.
Winobias 1_2
Winobias 1_2
Type 1 (syntactic) Winobias coreference schemas measuring occupational gender bias where syntactic cues are ambiguous.
Winobias 2_2
Winobias 2_2
Type 2 (semantic) Winobias coreference schemas measuring occupational gender bias where gender-revealing pronouns provide semantic cues.
Tool use
BFCL
BFCL
Berkeley Function Calling Leaderboard (BFCL) evaluating LLM ability to correctly call functions/APIs with proper arguments across diverse domains. Currently on v3.
Nexus
Nexus
Function calling benchmark from NexusFlow evaluating LLM tool use across zero-shot multi-function scenarios with nested function calls.
Understanding
Other
MMMU
Massive Multi-discipline Multimodal Understanding
11,500+ vision-language questions spanning 30 disciplines across six core areas (art, business, science, health, humanities, tech). Evaluates college-level multimodal reasoning.
MathVista
MathVista
Mathematical reasoning benchmark combining visual contexts (charts, geometry diagrams, scientific figures) from 28 existing datasets with 6,141 total problems.
MedQA
MedQA
Medical question answering benchmark based on USMLE licensing exams with 12,723 questions across US, Mainland China, and Taiwan medical boards.
OpenBookQA
OpenBookQA
5,957 elementary science multiple-choice questions requiring both open-book facts and broader commonsense knowledge to answer.
PIQA
Physical Interaction: Question Answering
16,000 physical commonsense reasoning questions asking models to choose between two physical solutions for everyday tasks.
SocialIQA
Social Intelligence QA
38,000 multiple-choice questions about everyday social situations requiring inference about emotions, motivations, and social norms.
COPA
Choice of Plausible Alternatives
1,000 causal reasoning questions asking models to choose the more plausible cause or effect for everyday scenarios.
LAMBADA
LAnguage Modeling Broadened to Account for Discourse Aspects
Long-range language modeling benchmark predicting the last word of ~2,600 passages that require understanding broad discourse context.
RACE
ReAding Comprehension Dataset From Examinations
28,000 reading comprehension passages with multiple-choice questions from Chinese middle and high school English exams.
SQuAD
Stanford Question Answering Dataset
100,000+ reading comprehension questions on 500+ Wikipedia articles. SQuAD 2.0 (2018) added unanswerable questions. Both versions remain widely cited.
QuAC
Question Answering in Context
98,407 conversational QA turns from 7,354 Wikipedia dialogues requiring multi-turn reasoning and follow-up question answering.
NaturalQ
Natural Questions
307,373 real Google search queries answered by extracting spans from Wikipedia, with short answer, long answer, and no-answer categories.
MultiNLI
Multi-Genre Natural Language Inference
433,000 hypothesis-premise pairs across 10 genres (fiction, telephone, travel, etc.) for textual entailment classification.
SciQ
SciQ
13,679 multiple-choice science questions across physics, chemistry, biology, and earth science, each paired with a supporting paragraph.
CRASS
Counterfactual Reasoning Assessment
Counterfactual reasoning benchmark testing LLMs on hypothetical scenarios and their logical implications.
ACI-BENCH
Ambient Clinical Intelligence Benchmark
Benchmark for generating clinical notes from doctor-patient conversation transcripts to evaluate ambient clinical intelligence systems.
MS-MARCO
MAchine Reading COmprehension Dataset
Large-scale machine reading comprehension and passage ranking benchmark from real Bing search queries, widely used for information retrieval evaluation.
QMSum
Query-based Multi-domain Meeting Summarization
Query-focused meeting summarization benchmark with 1,808 query-summary pairs across parliamentary, product, and academic meeting transcripts.
HHH
Helpfulness, Honesty, Harmlessness
Anthropic's alignment evaluation covering three core dimensions: helpfulness, honesty, and harmlessness, via human preference comparisons between model responses.
RAI
Responsible AI
Microsoft's framework for automated measurement of responsible AI harms in generative AI applications across fairness, reliability, safety, and privacy dimensions.
CodeXGLUE
CodeXGLUE
10 code intelligence tasks across code completion, code translation, code summarization, and code search for multiple programming languages.
LLM Judge
LLM Judge
LLM-as-a-judge evaluation paradigm using strong models (GPT-4) to score responses, enabling scalable assessment of open-ended generation quality.
LLM-Eval
LLM-Eval
Unified multi-dimensional automatic evaluation framework for LLMs assessing helpfulness, engagement, safety, and relevance in a single pass.
JudgeLM
JudgeLM
Fine-tuned language models trained to be scalable judges for open-ended LLM evaluation, achieving high agreement with human preferences.
Prometheus
Prometheus
LLM evaluation framework using rubric-based scoring with reference materials, enabling fine-grained absolute/relative assessment without GPT-4 dependency.
DocVQA
DocVQA
Visual question answering on 12,767 document images (scanned forms, invoices, tables) requiring understanding of text layout and document structure.
C-Eval
Chinese Evaluation Suite
Comprehensive Chinese-language evaluation suite with 13,948 multiple-choice questions across 52 disciplines and 4 difficulty levels (middle school to professional).
CMMLU
Chinese Massive Multitask Language Understanding
67-task Chinese MMLU equivalent covering 11,528 questions across natural science, social science, STEM, and Chinese-specific cultural knowledge.
GAOKAO
GAOKAO
Evaluation benchmark using questions from China's college entrance examination (Gaokao) across math, language arts, science, and social studies.
GAOKAO-MM
GAOKAO-MM
Multimodal extension of GAOKAO with text+image questions from Chinese college entrance exams, evaluating vision-language model capabilities.