$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan

2024-06-21

$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Summary

This paper introduces tau-bench, a new benchmark designed to evaluate how well language agents can interact with simulated human users and follow specific rules in real-world scenarios.

What's the problem?

Most existing benchmarks for language agents focus on their ability to understand language or perform tasks, but they often do not test how well these agents can interact with real people or adhere to specific rules that are important in practical applications. This lack of testing can lead to agents that are not reliable when deployed in real-world situations.

What's the solution?

The researchers created tau-bench, which simulates dynamic conversations between a user (represented by a language model) and a language agent that has access to specific tools and guidelines for the task at hand. The evaluation process involves comparing the final state of a database after the conversation to an expected goal state. They also introduced a new metric called pass^k to measure how consistently the agent behaves across multiple attempts at the same task. This benchmark allows for a more realistic assessment of an agent's performance in following rules and interacting with users.

Why it matters?

This research is significant because it addresses a critical gap in evaluating language agents, ensuring they can effectively communicate and follow rules in real-world applications. By improving how we assess these agents, tau-bench can help develop more reliable AI systems that are better suited for tasks like customer service, where understanding user requests and adhering to guidelines is essential.

Abstract

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose tau-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.

View Paper