LLM Benchmarks, Translated for Engineers

Metric: Accuracy / F12024

Agent

Long-context memory benchmark for conversational agents evaluating ability to recall and reason over extended multi-session dialogue histories.

RealTalk

Metric: Dialogue Quality Score2024

Agent

Real-world dialogue evaluation benchmark for conversational agents. Limited public documentation available.

ADRS (UCB-ADRS)

Metric: Task Completion Rate2024

Agent

UC Berkeley agentic decision-making reliability and safety benchmark testing coherence, persistence, and safety in long-horizon tasks. NOTE: slug contains parentheses — recommend renaming to 'adrs'.

Agents

tau-bench

τ-bench

Metric: % Task Success2024

AgentsTool use

Multi-turn tool-use benchmark for agent behavior in realistic retail and airline customer-service tasks, measuring whether models complete tasks through APIs while following policy constraints.

MultiChallenge

Metric: % Score

Metric: Function Calling Accuracy2023

Scale AI benchmark for multi-turn instruction following across instruction retention, inference memory, versioned editing, and self-coherence challenges.

BFCL v3

Berkeley Function Calling Leaderboard v3

superseded

AgentsTool use

Version 3 of Berkeley Function Calling Leaderboard, evaluating model accuracy on function and API calls. The collected April 2026 slice has limited frontier-model coverage and is superseded by the current BFCL leaderboard.

Terminal-Bench 2.0

Metric: % Tasks Completed2026

Second-generation terminal agent benchmark with 89 high-quality tasks spanning software engineering, machine learning, security, data science, and other real shell environments.

CursorBench v3.1

CursorBench

Cursor's proprietary coding-agent benchmark for evaluating IDE-native multi-file coding workflows. Scores are useful for Cursor product context but are not independently reproducible from a public harness.

OSWorld

Real-world computer use tasks in desktop OS environments

ClawEval-1.1

Agentic workflow benchmark covering tool-use integrity, task completion, and adversarial resistance. StepFun reported Step 3.7 Flash at 67.1 on the 1.1 release table.

Toolathlon

Tool-use benchmark reported in StepFun's Step 3.7 Flash launch materials.

Android Daily

Mobile UI and Android task benchmark reported in StepFun's Step 3.7 Flash launch materials.

BrowseComp

Browser/search task benchmark reported in StepFun's Step 3.7 Flash launch materials.

Arena

Chatbot Arena

Metric: Elo Rating2023

Arena

Crowdsourced human preference leaderboard using pairwise blind comparisons of LLM responses. Operated by LMSYS/UC Berkeley; rebranded and moved to lmarena.ai in 2024.

MT-Bench

Metric: MT-Bench Score (1-10)2023

Arena

80 multi-turn conversation questions across 8 categories evaluated by GPT-4 as judge. Scores range from 1-10 per turn.

Coding

HumanEval

Metric: Pass@12021

164 Python coding problems measuring functional correctness of code generation via pass@k metric. Released by OpenAI in 2021; HumanEval+ provides a more rigorous extension.

MBPP

Mostly Basic Programming Problems

Metric: Pass@k2021

974 Python programming problems for evaluating code generation, ranging from beginner to intermediate difficulty.

HumanEval+

Metric: Pass@12023

HumanEval+ extends HumanEval with an average of 764 test cases per problem (vs 9.6 in original), greatly increasing evaluation rigor for edge case coverage.

MBPP+

Mostly Basic Programming Problems+

Metric: Pass@12023

MBPP+ extends MBPP with 21.2 average test cases per problem, a 35x increase over the original for more rigorous code generation evaluation.

Spider

Metric: Execution Accuracy2018

Large-scale text-to-SQL benchmark with 10,181 questions across 200 databases from 138 domains, requiring cross-domain generalization.

ToolTalk

Metric: Task Success Rate2023

Conversational tool-use benchmark with 78 conversations requiring multi-step API calls across 28 tools including calendar, email, and messaging.

SWE-rebench

Metric: Resolved Rate2025

Evaluates LLM coding agents on real-world GitHub issues sourced after each model's training cutoff, preventing benchmark contamination. Uses standardized ReAct scaffolding with 128K token context; each model is run five times per problem and the best Pass@1 resolved rate is reported.

LiveCodeBench

Metric: Pass@12024

Continuously updated coding benchmark sourcing new problems from LeetCode, AtCoder, and Codeforces post-May 2023 to eliminate contamination. Evaluates code generation, self-repair, and execution prediction.

CRUXEval

Metric: Pass@12024

800 Python function input-output reasoning problems testing code execution understanding rather than code generation. Two tasks: input prediction and output prediction.

SWE-bench Verified

Metric: % Resolved2024

500 human-validated GitHub issue resolution tasks from SWE-bench, created with OpenAI in August 2024. The standard evaluation for agentic coding systems. Top performers (2026) exceed 78% resolved.

Terminal-Bench 2.0

Metric: % Tasks Completed2026

Second-generation terminal agent benchmark with 89 high-quality tasks spanning software engineering, machine learning, security, data science, and other real shell environments.

SWE-bench Multilingual

Metric: % Resolved2026

Multilingual SWE-bench suite evaluating software engineering issue resolution across repositories and languages beyond the original Python-heavy SWE-bench tasks.

CursorBench v3.1

CursorBench

Code editing across 8 programming languages

Aider Polyglot

Metric: % Exercises Completed2024

Real-world code editing benchmark measuring a model's ability to apply changes to existing codebases across 8 programming languages using Exercism platform exercises.

1,140 complex Python programming tasks

BigCodeBench

Metric: Pass@12024

1,140 complex Python programming tasks spanning diverse real-world domains requiring multi-library function calls. Two variants: Complete (function completion) and Instruct (natural language to code).

SWE-bench Pro

Metric: % Resolved2025

731-task multilingual real-world GitHub issue benchmark extending SWE-bench Verified with harder, more diverse tasks across Python, JavaScript, TypeScript, Java, Go, C++, and Rust.

Terminal-Bench

Measures ability to complete real-world terminal/CLI tasks autonomously

Composite

BBH

BIG-Bench Hard

Metric: Accuracy2022

Composite

23 challenging tasks from BIG-Bench where even large models scored below human performance without chain-of-thought. Tests logical deduction, causal judgment, and formal fallacies.

WildBench

Metric: WB-Reward / WB-Score2024

Composite

Automated LLM evaluation using 1,024 real user queries from WildChat with LLM-as-judge scoring. Achieves 0.98 Spearman correlation with Chatbot Arena Elo. V2 with periodic updates.

AGIEval

Artificial General Intelligence Eval

Composite

Evaluation suite based on 20 official human-centric standardized exams (SAT, LSAT, GRE, bar exam) spanning English and Chinese.

General

MMLU

Massive Multitask Language Understanding

Metric: Accuracy2021

General

Tests LLMs on undergraduate to professional level knowledge across 57 subjects with 15,908 multiple-choice questions. Top models now saturate at 86–90%; MMLU-Pro is the recommended harder successor.

IFEval

Instruction-Following Evaluation

Metric: Instruction Following Score2023

General

IFEval measures LLM ability to follow verifiable formatting instructions (e.g., output length, keyword inclusion, capitalization) with exact-match checking.

MMLU PRO

General

Enhanced MMLU with 10-option questions (vs 4 in original), harder problems filtered through expert review, and reduced shortcuts. Maintains coverage of 57 academic subjects.

Holistic

HELM (Holistic Evaluation of Language Models)

Metric: Multiple metrics2022

Holistic

Stanford framework evaluating LLMs across 30+ scenarios spanning 7 dimensions: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. NOTE: slug contains parentheses — recommend renaming to 'helm'.

Knowledge

TriviaQA

Metric: F1 / Exact Match2017

Knowledge

95,000 trivia question-answer pairs with associated evidence documents from Wikipedia and the web for reading comprehension evaluation.

Long context

ZeroSCROLLS/QuALITY

ZeroSCROLLS suite's QuALITY subset: multiple-choice QA on long-form documents (articles, books) requiring reading up to 10K tokens for answer derivation.

InfiniteBench/En.MC

English multiple-choice subset of InfiniteBench designed for 100K+ token context evaluation, requiring retrieval and reasoning over very long documents.

NIH/Multi-needle

Metric: Retrieval Accuracy2024

Multi-needle retrieval benchmark extending the 'needle in a haystack' test to require simultaneous retrieval of multiple facts embedded in long contexts.

RULER

Metric: RULER Score2024

Flexible benchmark testing effective context utilization across 13 diverse long-context tasks including retrieval, multi-hop reasoning, and aggregation at configurable lengths.

Mathematics

GSM8K

Grade School Math 8K

Metric: Accuracy2021

8,500 grade school math word problems requiring 2–8 steps of arithmetic reasoning. Widely used for evaluating chain-of-thought reasoning; top models now near 100%.

MATH

Mathematics Aptitude Test of Heuristics

Metric: Accuracy2021

12,500 challenging competition mathematics problems from AMC, AIME, and other high-school math competitions across 7 difficulty levels.

HMMT Feb 2026

HMMT February 2026

MathArena benchmark based on the February 2026 Harvard-MIT Mathematics Tournament, cited in Microsoft AI's MAI-Thinking-1 technical report.

MATH-500

500-problem subset of the MATH benchmark covering competition mathematics

AIME 2024

American Invitational Mathematics Examination 2024 problems

AIME 2025

American Invitational Mathematics Examination 2025 problems

AIME 2026

American Invitational Mathematics Examination 2026 problems.

Multilingual

MGSM

Multilingual Grade School Math

Metric: Accuracy2022

Multilingual

250 GSM8K problems translated into 10 typologically diverse languages to evaluate multilingual math reasoning.

Multimodal

Roundabout-TAU

Metric: Traffic Anomaly Detection Accuracy2026

Multimodal

Visual language model benchmark for traffic anomaly understanding from surveillance video footage, using question-answering format to evaluate visual reasoning in transportation contexts.

CharXiv

Multimodal

Chart understanding benchmark with descriptive and reasoning questions

Reasoning

GPQA

Google-Proof Q&A

PhD-level multiple-choice questions in biology, physics, and chemistry designed so that even experts with internet access score below 67%. The Diamond subset (198 questions) is the hardest variant used in most frontier model evaluations.

DROP

Discrete Reasoning Over Paragraphs

Metric: F1 / EM2019

Reading comprehension benchmark requiring numerical reasoning, counting, sorting, and set operations over paragraphs.

ARC

AI2 Reasoning Challenge

Grade-school science multiple-choice questions partitioned into Easy and Challenge (hard) sets. ARC-Challenge is the standard evaluation variant.

HellaSwag

Commonsense sentence-completion benchmark using adversarially filtered wrong answers. Top LLMs now exceed 95% accuracy.

ANLS*

Average Normalized Levenshtein Similarity*

Metric: ANLS*2024

Extended ANLS metric for evaluating visual question answering on documents, supporting free-form answer evaluation beyond exact match.

TruthfulQA

Metric: TruthfulQA Score2021

817 questions spanning 38 categories where humans commonly hold misconceptions. Measures whether models give truthful answers or reproduce popular falsehoods.

WinoGrande

44,000 adversarially filtered Winograd schema problems for commonsense reasoning, created at scale to reduce dataset biases.

ARC-E

AI2 Reasoning Challenge-Easy

The easy partition of ARC with 7,787 grade-school science multiple-choice questions answerable without complex reasoning.

BoolQ

Boolean Questions

15,942 yes/no reading comprehension questions derived from real Google searches paired with Wikipedia passages.

CommonsenseQA

12,102 commonsense reasoning multiple-choice questions crowdsourced using ConceptNet to target implicit world knowledge.

OlympiadBench

8,952 bilingual (Chinese/English) math and physics Olympiad problems from national/international competitions, testing advanced scientific reasoning with precise symbolic answers.

MathVerse

2,612 visual math problems across 5 versions (text dominant, text-lite, text-only, vision-intensive, vision-only) evaluating multimodal mathematical reasoning.

ARC Prize / ARC Challenge

Metric: ARC-AGI Score2019

Abstraction and Reasoning Corpus (ARC-AGI) with visual grid puzzles testing core reasoning and novel problem-solving. ARC Prize 2024-2025 competitions track annual progress. NOTE: slug contains invalid characters '/' — recommend renaming to 'arc-prize'.

DynaBench

Metric: Task Accuracy2021

Dynamic adversarial data collection platform generating ever-harder NLU benchmarks through human-and-model-in-the-loop annotation to prevent contamination.

Humanity's Last Exam

Multi-domain expert-level exam covering science, math, humanities, and more

ARC-AGI-2

Second iteration of the Abstraction and Reasoning Corpus for AGI evaluation

Research

DeepSearchQA

Metric: F12026

Research

Search-oriented question answering benchmark reported in StepFun's Step 3.7 Flash launch materials.

ResearchRubrics

Research

Research-task rubric benchmark reported in StepFun's Step 3.7 Flash launch materials.

Safety

ToxiGen

Metric: Toxicity Classification Accuracy2022

274,000 machine-generated toxic and benign statements about 13 minority groups, using adversarial classifier-in-the-loop decoding for implicit hate speech evaluation.

RealToxicity

Metric: Toxicity Probability2020

100,000 naturally occurring web text prompts for measuring the propensity of language models to generate toxic continuations using the Perspective API.

CrowS-Pairs

Metric: Bias Score2020

1,508 crowdsourced sentence pairs measuring stereotypical bias in masked language models across 9 categories including race, gender, and religion.

BBQ Ambig

Metric: Bias Score2022

Ambiguous context subset of the Bias Benchmark for QA (BBQ), measuring how models respond to social bias questions when context is underspecified.

BBQ Disambig

Metric: Bias Score2022

Disambiguated context subset of BBQ that tests whether additional clarifying context reduces biased model predictions.

Winogender

Metric: Gender Bias Score2018

720 Winograd-style schemas testing gender bias in pronoun resolution, checking whether models associate professions with gender stereotypes.

Winobias 1_2

Metric: Gender Bias Score2018

Type 1 (syntactic) Winobias coreference schemas measuring occupational gender bias where syntactic cues are ambiguous.

Winobias 2_2

Metric: Gender Bias Score2018

Type 2 (semantic) Winobias coreference schemas measuring occupational gender bias where gender-revealing pronouns provide semantic cues.

Tool use

BFCL

Metric: Function Calling Accuracy2023

Tool use

Berkeley Function Calling Leaderboard (BFCL) evaluating LLM ability to correctly call functions/APIs with proper arguments across diverse domains. Currently on v3.

Nexus

Metric: Function Calling Accuracy2024

Tool use

Function calling benchmark from NexusFlow evaluating LLM tool use across zero-shot multi-function scenarios with nested function calls.

tau-bench

τ-bench

Metric: % Task Success2024

AgentsTool use

Multi-turn tool-use benchmark for agent behavior in realistic retail and airline customer-service tasks, measuring whether models complete tasks through APIs while following policy constraints.

BFCL v3

Berkeley Function Calling Leaderboard v3

superseded

Metric: Function Calling Accuracy2023

MCP-Atlas

Benchmark for evaluating model performance on Model Context Protocol tool use tasks

Understanding

ANLI

Adversarial Natural Language Inference

Understanding

Adversarially collected NLI dataset across 3 rounds of increasing difficulty, designed to fool current models and expose failure modes.

Vision

SimpleVQA with Search Tool

Visual question-answering benchmark variant that permits search-tool use. StepFun reported Step 3.7 Flash at 79.2 in its launch materials.

V* with Python

Vision reasoning benchmark variant that allows Python tool use. StepFun reported Step 3.7 Flash at 95.3 in its launch materials.

WorldVQA

Visual question-answering benchmark reported in StepFun's Step 3.7 Flash launch materials.

HR-Bench 4K

High-resolution visual understanding benchmark reported in StepFun's Step 3.7 Flash launch materials.

Other

MMMU

Massive Multi-discipline Multimodal Understanding

11,500+ vision-language questions spanning 30 disciplines across six core areas (art, business, science, health, humanities, tech). Evaluates college-level multimodal reasoning.

MathVista

Mathematical reasoning benchmark combining visual contexts (charts, geometry diagrams, scientific figures) from 28 existing datasets with 6,141 total problems.

MedQA

Medical question answering benchmark based on USMLE licensing exams with 12,723 questions across US, Mainland China, and Taiwan medical boards.

OpenBookQA

5,957 elementary science multiple-choice questions requiring both open-book facts and broader commonsense knowledge to answer.

PIQA

Physical Interaction: Question Answering

16,000 physical commonsense reasoning questions asking models to choose between two physical solutions for everyday tasks.

SocialIQA

Social Intelligence QA

38,000 multiple-choice questions about everyday social situations requiring inference about emotions, motivations, and social norms.

COPA

Choice of Plausible Alternatives

Metric: Accuracy2011

1,000 causal reasoning questions asking models to choose the more plausible cause or effect for everyday scenarios.

LAMBADA

LAnguage Modeling Broadened to Account for Discourse Aspects

Metric: Accuracy2016

Long-range language modeling benchmark predicting the last word of ~2,600 passages that require understanding broad discourse context.

RACE

ReAding Comprehension Dataset From Examinations

Metric: Accuracy2017

28,000 reading comprehension passages with multiple-choice questions from Chinese middle and high school English exams.

SQuAD

Stanford Question Answering Dataset

Metric: Exact Match / F12016

100,000+ reading comprehension questions on 500+ Wikipedia articles. SQuAD 2.0 (2018) added unanswerable questions. Both versions remain widely cited.

QuAC

Question Answering in Context

Metric: F1 / HEQ2018

98,407 conversational QA turns from 7,354 Wikipedia dialogues requiring multi-turn reasoning and follow-up question answering.

NaturalQ

Natural Questions

Metric: Exact Match / F12019

307,373 real Google search queries answered by extracting spans from Wikipedia, with short answer, long answer, and no-answer categories.

MultiNLI

Multi-Genre Natural Language Inference

Metric: Accuracy2017

433,000 hypothesis-premise pairs across 10 genres (fiction, telephone, travel, etc.) for textual entailment classification.

SciQ

Metric: Accuracy2017

13,679 multiple-choice science questions across physics, chemistry, biology, and earth science, each paired with a supporting paragraph.

CRASS

Counterfactual Reasoning Assessment

Metric: Accuracy2022

Counterfactual reasoning benchmark testing LLMs on hypothetical scenarios and their logical implications.

ACI-BENCH

Ambient Clinical Intelligence Benchmark

Metric: ROUGE / BERTScore2023

Benchmark for generating clinical notes from doctor-patient conversation transcripts to evaluate ambient clinical intelligence systems.

MS-MARCO

MAchine Reading COmprehension Dataset

Metric: MRR@10 / NDCG@102016

Large-scale machine reading comprehension and passage ranking benchmark from real Bing search queries, widely used for information retrieval evaluation.

QMSum

Query-based Multi-domain Meeting Summarization

Metric: ROUGE2021

Query-focused meeting summarization benchmark with 1,808 query-summary pairs across parliamentary, product, and academic meeting transcripts.

HHH

Helpfulness, Honesty, Harmlessness

Metric: Human Preference Rate2022

Anthropic's alignment evaluation covering three core dimensions: helpfulness, honesty, and harmlessness, via human preference comparisons between model responses.

RAI

Responsible AI

Metric: Harm Rate2023

Microsoft's framework for automated measurement of responsible AI harms in generative AI applications across fairness, reliability, safety, and privacy dimensions.

CodeXGLUE

Metric: Varies by task2021

10 code intelligence tasks across code completion, code translation, code summarization, and code search for multiple programming languages.

LLM Judge

Metric: Judge Score2023

LLM-as-a-judge evaluation paradigm using strong models (GPT-4) to score responses, enabling scalable assessment of open-ended generation quality.

LLM-Eval

Metric: Multi-dimensional Score2023

Unified multi-dimensional automatic evaluation framework for LLMs assessing helpfulness, engagement, safety, and relevance in a single pass.

JudgeLM

Metric: Judge Agreement Rate2023

Fine-tuned language models trained to be scalable judges for open-ended LLM evaluation, achieving high agreement with human preferences.

Prometheus

Metric: Rubric Score2023

LLM evaluation framework using rubric-based scoring with reference materials, enabling fine-grained absolute/relative assessment without GPT-4 dependency.

DocVQA

Metric: ANLS2020

Visual question answering on 12,767 document images (scanned forms, invoices, tables) requiring understanding of text layout and document structure.

C-Eval

Chinese Evaluation Suite

Comprehensive Chinese-language evaluation suite with 13,948 multiple-choice questions across 52 disciplines and 4 difficulty levels (middle school to professional).

CMMLU

Chinese Massive Multitask Language Understanding

67-task Chinese MMLU equivalent covering 11,528 questions across natural science, social science, STEM, and Chinese-specific cultural knowledge.

GAOKAO

Evaluation benchmark using questions from China's college entrance examination (Gaokao) across math, language arts, science, and social studies.

GAOKAO-MM

Multimodal extension of GAOKAO with text+image questions from Chinese college entrance exams, evaluating vision-language model capabilities.

Terminal-Bench 2.1

Terminal-Bench 2.1 is an agentic terminal coding benchmark measuring model performance on complex terminal-based programming tasks requiring multi-step reasoning and tool use. An updated version of Terminal-Bench 2.0.

GDPval-AA

GDPval-AA is a knowledge work benchmark measuring AI model performance on complex document drafting, data analysis, and enterprise knowledge tasks. Appeared in the Claude Opus 4.8 launch benchmark table.

Finance Agent v2