OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Hongming Zhang, Tianqing Fang, Zhenzhong Lan, Dong Yu

2024-10-30

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

Summary

This paper introduces OpenWebVoyager, a new framework that helps create intelligent web agents capable of exploring the internet and learning from their experiences to improve their performance over time.

What's the problem?

Many existing web agents are limited because they only work in controlled environments where they can easily follow rules and receive clear rewards. These agents struggle to adapt to the real world, where situations are more complex and require understanding both text and images. This makes it hard for them to learn effectively and perform well in diverse scenarios.

What's the solution?

To solve this problem, the authors developed OpenWebVoyager, which trains agents using a process called imitation learning to gain basic skills. After initial training, these agents explore the open web on their own, gathering feedback about their actions. They then use this feedback to improve their decision-making through a cycle of exploration, feedback, and optimization. This iterative process allows the agents to continuously enhance their abilities based on real-world experiences.

Why it matters?

This research is significant because it demonstrates how autonomous web agents can learn and adapt in real-world environments, making them more useful for tasks like web navigation and information retrieval. By improving how these agents operate, OpenWebVoyager could lead to more effective AI systems that can assist users in navigating complex online spaces.

Abstract

The rapid development of large language and multimodal models has sparked significant interest in using proprietary models, such as GPT-4o, to develop autonomous agents capable of handling real-world scenarios like web navigation. Although recent open-source efforts have tried to equip agents with the ability to explore environments and continuously improve over time, they are building text-only agents in synthetic environments where the reward signals are clearly defined. Such agents struggle to generalize to realistic settings that require multimodal perception abilities and lack ground-truth signals. In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously conduct real-world exploration and improve itself. We first train the base model with imitation learning to gain the basic abilities. We then let the agent explore the open web and collect feedback on its trajectories. After that, it further improves its policy by learning from well-performing trajectories judged by another general-purpose model. This exploration-feedback-optimization cycle can continue for several iterations. Experimental results show that our web agent successfully improves itself after each iteration, demonstrating strong performance across multiple test sets.

View Paper