LLM Reference

Benchmarks

73 evaluation benchmarks for language models

Arena

Coding

Composite

General

Knowledge

Long context

Mathematics

Multilingual

Reasoning

Safety

Tool use

Understanding

Other

MMMU

Massive Multi-discipline Multimodal Understanding

Tests math and reasoning skills on multimodal data

MathVista

MathVista

Tests mathematical reasoning abilities

MedQA

MedQA

OpenBookQA

OpenBookQA

PIQA

Physical Interaction: Question Answering

SocialIQA

Social Intelligence QA

COPA

Choice of Plausible Alternatives

LAMBADA

LAnguage Modeling Broadened to Account for Discourse Aspects

RACE

ReAding Comprehension Dataset From Examinations

SQuAD

Stanford Question Answering Dataset

QuAC

Question Answering in Context

NaturalQ

Natural Questions

MultiNLI

Multi-Genre Natural Language Inference

SciQ

SciQ

CRASS

Counterfactual Reasoning Assessment

ACI-BENCH

Ambient Clinical Intelligence Benchmark

MS-MARCO

MAchine Reading COmprehension Dataset

QMSum

Query-based Multi-domain Meeting Summarization

HHH

Helpfulness, Honesty, Harmlessness

RAI

Responsible AI

CodeXGLUE

CodeXGLUE

LLM Judge

LLM Judge

LLM-Eval

LLM-Eval

JudgeLM

JudgeLM

Prometheus

Prometheus

DocVQA

DocVQA

C-Eval

Chinese Evaluation Suite

C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels.

CMMLU

Chinese Massive Multitask Language Understanding

CMMLU is a comprehensive evaluation benchmark specifically designed to evaluate the knowledge and reasoning abilities of LLMs within the context of Chinese language and culture.

GAOKAO

GAOKAO

An evaluation benchmark based on a dataset of Chinese college entrance examination questions.

GAOKAO-MM

GAOKAO-MM

A multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO).