LLM ReferenceLLM Reference
tau-benchactiveAgentsTool use

τ-bench

Metric: % Task Success (higher is better)Introduced: 2024

About

Multi-turn tool-use benchmark for agent behavior in realistic retail and airline customer-service tasks, measuring whether models complete tasks through APIs while following policy constraints.