How Far Are We from Genuinely Useful Deep Research Agents?
Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou
2025-12-02
Summary
This paper focuses on the challenges of building computer programs, called Deep Research Agents, that can automatically create in-depth research reports like a human analyst would. It identifies weaknesses in how these agents are currently tested and proposes new tools to better evaluate and improve them.
What's the problem?
Currently, most tests for these research agents just check if they can answer simple questions, not if they can build a complete, well-supported report. Existing benchmarks for report creation are too complicated and rely on subjective opinions to judge quality, making it hard to know if the agents are actually useful. There's a lack of clear understanding of *where* these agents fail when trying to do real research.
What's the solution?
The researchers created a new, more detailed testing benchmark called FINDER, which includes 100 research tasks broken down into 419 specific checklist items to ensure reports are structured, thorough, and factually accurate. They then used this benchmark to test existing agents and, based on the results, developed a 'failure taxonomy' called DEFT. DEFT identifies 14 specific ways these agents can go wrong during research, covering issues with finding information, understanding it, and putting it all together logically.
Why it matters?
This work is important because it provides a more realistic and reliable way to assess the capabilities of these research agents. By pinpointing exactly where they struggle – specifically with combining evidence, checking facts, and planning a research approach – it guides developers towards building more effective tools that can truly assist with complex research tasks. It moves the field beyond simply answering questions to actually *doing* research.
Abstract
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.