LLM ReferenceLLM Reference
GPQAactiveReasoning

Google-Proof Q&A

Metric: Accuracy (higher is better)Introduced: 2023

About

PhD-level multiple-choice questions in biology, physics, and chemistry designed so that even experts with internet access score below 67%. The Diamond subset (198 questions) is the hardest variant used in most frontier model evaluations.