Understanding DeepResearch via Reports

Tianyu Fan, Xinyao Niu, Yuxiang Zheng, Fengji Zhang, Chengen Huang, Bei Chen, Junyang Lin, Chao Huang

2025-10-13

Summary

This paper introduces a new way to test AI systems designed to do in-depth research, called DeepResearch agents. These agents are meant to be able to conduct research like an expert, but it's been hard to figure out how well they *actually* perform.

What's the problem?

Evaluating these DeepResearch agents is really difficult because they aren't just answering simple questions. They're supposed to gather information from many sources, come up with new ideas, and write up a complete report. Existing tests for AI don't really measure these complex skills – they usually focus on one thing at a time, like just answering a question or summarizing a text. It's hard to tell if a DeepResearch agent's report is actually good, accurate, and insightful.

What's the solution?

The researchers created a framework called DeepResearch-ReportEval to specifically assess these agents. They focused on evaluating the research *reports* the agents produce, looking at three key things: how high quality the report is, whether it repeats information unnecessarily, and how factually accurate it is. They used another AI model to act as a judge, and made sure this 'judge' AI agreed with what human experts would say. They also built a set of 100 research questions covering many different topics to use as a standard test.

Why it matters?

This work is important because it provides a standardized way to compare different DeepResearch systems and see how they stack up. As these AI agents become more advanced and are expected to help with real research, we need a reliable way to measure their abilities and understand their strengths and weaknesses. This research helps move these systems from being simple information finders to true research partners.

Abstract

DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval.

View Paper