Scaling Autonomous Agents via Automatic Reward Modeling And Planning
Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan
2025-02-19
Summary
This paper talks about a new way to make AI language models better at solving complex problems that require multiple steps and interaction with the environment. It's like teaching a smart computer to learn from its own experiences without needing constant human guidance.
What's the problem?
While AI language models are great at tasks like writing and answering questions, they struggle with more complex problems that need step-by-step decision-making, like online shopping or solving math problems. It's hard to collect enough data to teach them these skills, and many powerful AI models are too expensive or complicated to customize for specific tasks.
What's the solution?
The researchers created a system where one AI explores an environment randomly, like a curious child trying different things. Another AI watches and figures out what works and what doesn't, creating examples of good and bad decisions. These examples are then used to train a 'reward model' that can score different actions. This reward model helps guide the AI in making better decisions when faced with complex tasks.
Why it matters?
This matters because it could make AI much more capable of handling real-world problems that require multiple steps and interaction with the environment. By teaching AI to learn from its own experiences, we can create smarter systems without needing constant human input. This could lead to AI that's better at tasks like online shopping, scientific research, or even solving complex math problems, potentially making these AIs much more useful in our daily lives and in various industries.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision-making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.