WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
2025-09-17
Summary
This paper focuses on improving how well large language models (LLMs) can find information online, specifically when the task is really complicated and there's a lot of uncertainty about where to look.
What's the problem?
Current open-source LLMs struggle with complex information-seeking tasks, like researching a difficult topic on the internet. They don't handle uncertainty well when faced with a huge amount of information, unlike some private, more advanced systems that can perform at a 'superhuman' level on these tasks. The core issue is these models lack a systematic way to narrow down possibilities and deal with not knowing where to start.
What's the solution?
The researchers developed a new training method called WebSailor. It works by creating challenging tasks with intentionally unclear information, forcing the LLM to learn how to deal with uncertainty. They also use a technique called 'Duplicating Sampling Policy Optimization' (DUPO) which helps the model learn efficiently through trial and error, acting like an agent exploring the web. This whole process is designed to teach the model to strategically reduce uncertainty as it searches.
Why it matters?
This work is important because it significantly closes the performance gap between open-source and private LLMs in complex online research. By giving open-source models the ability to handle uncertainty effectively, it makes powerful information-seeking capabilities more widely available and allows anyone to build systems that can tackle really difficult research questions.
Abstract
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.