Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents

Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, Ahmed Awadallah

2025-02-18

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for
Multimodal Web Agents

Summary

This paper talks about Explorer, a new system that creates a huge dataset of web browsing examples to help train AI agents that can navigate websites and complete tasks like humans do.

What's the problem?

Current AI web agents aren't as good as humans at navigating real websites because they don't have enough diverse, real-world examples to learn from. Creating these examples manually is expensive and time-consuming.

What's the solution?

The researchers developed Explorer, which automatically generates a large and varied set of web browsing examples. It created over 94,000 successful web navigation paths across 49,000 different websites, including screenshots and web elements. They then used this data to train an AI web agent called Explorer, which performed well on various tests for web navigation skills.

Why it matters?

This matters because it could lead to better AI assistants that can help people use websites more effectively. It makes it easier and cheaper for researchers to create large datasets for training web AI, which could speed up the development of more capable AI systems that can understand and interact with the web like humans do.

Abstract

Recent success in large multimodal models (LMMs) has sparked promising applications of agents capable of autonomously completing complex web tasks. While open-source LMM agents have made significant advances in offline evaluation benchmarks, their performance still falls substantially short of human-level capabilities in more realistic online settings. A key bottleneck is the lack of diverse and large-scale trajectory-level datasets across various domains, which are expensive to collect. In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date, containing over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements. In particular, we leverage extensive web exploration and refinement to obtain diverse task intents. The average cost is 28 cents per successful trajectory, making it affordable to a wide range of users in the community. Leveraging this dataset, we train Explorer, a multimodal web agent, and demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++. Additionally, our experiments highlight data scaling as a key driver for improving web agent capabilities. We hope this study makes state-of-the-art LMM-based agent research at a larger scale more accessible.

View Paper