How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench
Venkatesh Mishra, Amir Saeidi, Satyam Raj, Mutsumi Nakamura, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral
2025-09-02
Summary
This paper investigates how to make AI agents, powered by large language models, better at completing tasks that require multiple steps and using different tools in a conversation with a user.
What's the problem?
Currently, these AI agents struggle to stay consistent in their reasoning throughout a long conversation, follow specific rules for how they should act, and accurately remember and use information they've gathered over many interactions with tools. Essentially, they get confused and make mistakes when tasks get complex and require a lot of back-and-forth.
What's the solution?
The researchers carefully analyzed the types of errors these agents make during conversations. Then, they experimented with ways to reword the user's requests to the agent, giving it extra guidance. Finally, they created a new system called IRMA, which automatically rewrites user questions, adding relevant rules and suggesting helpful tools for the agent to use, helping it stay focused and make better decisions.
Why it matters?
IRMA performs significantly better than existing methods, meaning it's more reliable and consistent at completing complex tasks in dynamic environments. This is a big step towards creating AI agents that can truly assist us with complicated, real-world problems that require ongoing interaction and tool use.
Abstract
Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like tau-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.