Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su

2024-11-21

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Summary

This paper introduces a new approach called WebDreamer, which enhances language agents by using large language models (LLMs) to plan and simulate actions when navigating the web, making them more effective at completing tasks.

What's the problem?

Current language agents can automate tasks on the web, but they often react to situations rather than plan ahead. This leads to poorer performance compared to humans, especially when making decisions that require careful consideration of multiple steps. Additionally, directly testing these agents on live websites can be risky because some actions can't be undone, like making a purchase.

What's the solution?

WebDreamer solves these problems by using LLMs as 'world models' that simulate what would happen if the agent took certain actions on a website. Instead of executing actions in real-time, WebDreamer allows the agent to think ahead and imagine the outcomes of different choices using natural language descriptions. This way, it can evaluate which action is best before actually doing anything. The paper shows that this method significantly improves the performance of language agents in web tasks while reducing risks and errors.

Why it matters?

This research is important because it represents a shift in how we can use AI to interact with complex online environments. By enabling agents to plan their actions more effectively, WebDreamer could lead to smarter web automation tools that work better in real-world applications, such as booking travel or shopping online.

Abstract

Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction -- VisualWebArena and Mind2Web-live -- demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.

View Paper