Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, Jinyoung Yeo

2024-10-21

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Summary

This paper discusses a new approach called Web Agents with World Models, which helps AI agents make better decisions while navigating the web by simulating the outcomes of their actions.

What's the problem?

Current AI agents, especially those using large language models (LLMs), often struggle with long-term tasks on the web, leading to mistakes like repeatedly booking non-refundable tickets. This happens because these models lack an understanding of the potential consequences of their actions, which humans naturally consider. Without this awareness, they can easily make irreversible errors.

What's the solution?

To solve this problem, the authors developed a new type of web agent called a World-model-augmented (WMA) agent. This agent simulates the results of its actions before taking them, which helps it make better decisions. They introduced a method called transition-focused observation abstraction to help the model predict outcomes more effectively, even when faced with complex web pages and repeated elements. Their experiments showed that these world models significantly improved the agents' decision-making abilities without needing extensive retraining.

Why it matters?

This research is important because it enhances how AI agents can interact with the web by making them more intelligent and capable of avoiding costly mistakes. By incorporating world models into their decision-making processes, these agents can operate more like humans, leading to better performance in tasks such as online shopping, information gathering, and other long-term web navigation activities.

Abstract

Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the "world model". Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4o, Claude-3.5-Sonnet, etc.). Then, we present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training and demonstrate our agents' cost- and time-efficiency compared to recent tree-search-based agents.

View Paper