tau-benchactiveAgentsTool use
τ-bench
Metric: % Task Success (higher is better)Introduced: 2024
About
Multi-turn tool-use benchmark for agent behavior in realistic retail and airline customer-service tasks, measuring whether models complete tasks through APIs while following policy constraints.