DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, Chengkang Jiang, Zhaohui Wang, Yubin Guo, Yuqing Wen, Jiayang Mao, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu

2026-04-17

$DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation$

Summary

This paper introduces a new way to test 'Deep Research Agents,' which are AI systems designed to do complex research tasks like writing reports. The main issue is that it's been hard to reliably evaluate how well these agents actually perform because the internet is constantly changing and research questions can be open to interpretation.

What's the problem?

Evaluating these AI research agents is really difficult. The real world web is messy and dynamic, meaning tests can break or give different results over time. Also, research tasks aren't always clearly defined, so it's hard to know if an agent is truly 'correct' in its findings. Existing methods aren't good at consistently and fairly measuring an agent's abilities in a realistic setting.

What's the solution?

The researchers created a benchmark called DR^{3}-Eval. This benchmark uses real research materials provided by people, but puts them in a controlled 'sandbox' environment that mimics the internet without actually being the live web. This makes the tests repeatable and verifiable. They also developed a detailed scoring system that looks at things like whether the agent found the right information, if its claims are accurate, how well it followed instructions, and the overall quality of the report it generated. They even tested their own AI agent, DR^{3}-Agent, using this benchmark.

Why it matters?

This work is important because it provides a standardized and reliable way to measure the progress of AI research agents. By having a good benchmark, researchers can more easily compare different agents, identify their weaknesses (like struggling to find reliable information or making things up), and ultimately build better AI systems that can assist with complex research tasks.

Abstract

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR^{3}-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR^{3}-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR^{3}-Agent based on multiple state-of-the-art language models demonstrate that DR^{3}-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

View Paper