BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent
Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin
2025-08-12
Summary
This paper talks about BrowseComp-Plus, a new and improved test that helps fairly judge how well deep research agents, which are AI systems combining language understanding and search tools, can find and explain complex information using a fixed set of web documents.
What's the problem?
The problem is that earlier tests relied on searching the live internet, which is always changing and hard to copy exactly. This made it unfair and confusing to see how good the AI agents really were or to tell if their search skills or language skills were helping more.
What's the solution?
The paper creates BrowseComp-Plus, which uses a carefully chosen, unchanging collection of web pages with verified facts and tricky questions. This controlled setting allows researchers to test and compare AI agents more fairly by separating their searching ability from their language understanding, so results are clearer and can be repeated by others.
Why it matters?
This matters because having a fair and clear way to test AI research agents helps scientists understand where these systems do well and where they need work. This helps make smarter AI that can better assist people in finding and making sense of complex information reliably.
Abstract
BrowseComp-Plus, a curated benchmark, enables controlled evaluation of deep research agents and retrieval methods, providing insights into their performance and effectiveness.