BBHactiveComposite
BIG-Bench Hard
Metric: Accuracy (higher is better)Introduced: 2022
About
23 challenging tasks from BIG-Bench where even large models scored below human performance without chain-of-thought. Tests logical deduction, causal judgment, and formal fallacies.