DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking
Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang
2025-10-23
Summary
This paper introduces a new challenge for AI search agents, called DeepWideSearch, because current agents struggle with tasks that require both detailed investigation and looking at a lot of different information sources at the same time.
What's the problem?
Existing search agents are good at either digging deep into a specific topic by following links and reasoning through information, or quickly scanning a wide range of sources, but they can't reliably do both effectively. This is a big problem for real-world tasks like market research where you need to understand complex trends by combining information from many places.
What's the solution?
The researchers created DeepWideSearch, a set of 220 challenging questions across 15 different areas, designed to test an agent's ability to handle both 'deep' reasoning and 'wide' information gathering. They then tested several existing AI agents on this benchmark and analyzed where they failed.
Why it matters?
The results showed that even the best AI agents performed poorly on DeepWideSearch, meaning there's a significant gap in current AI capabilities. By releasing this benchmark, the researchers hope to encourage the development of more powerful and versatile search agents that can tackle complex, real-world information-seeking problems.
Abstract
Current search agents fundamentally lack the ability to simultaneously perform deep reasoning over multi-hop retrieval and wide-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.