ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, Kai Jia

2025-08-27

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

Summary

This paper introduces ReportBench, a new way to test how well AI research assistants actually do research. These AI tools are getting good at quickly putting together reports, but we need to be sure the information they provide is accurate and complete before we rely on them.

What's the problem?

AI is now being used to help with research, speeding up the process, but there's a big question: how do we *know* the research these AIs produce is trustworthy? Simply being fast isn't enough; the reports need to be factually correct, well-supported by evidence, and cover the important aspects of a topic. There wasn't a good, standardized way to check these things until now.

What's the solution?

The researchers created ReportBench, which works by taking existing, high-quality research surveys and using them as a 'gold standard'. They then ask the AI research assistants to create reports on the same topics. ReportBench automatically checks the AI's work in two main ways: first, it verifies that the sources the AI cites actually support the claims made, and second, it checks if any claims *not* backed by citations are actually true using information found on the web. This automated system provides a detailed evaluation of the AI's research quality.

Why it matters?

This work is important because it provides a reliable method for evaluating and improving AI research assistants. The results show that while current AI tools are better than just using a language model alone, they still have significant flaws in terms of thoroughness and accuracy. By identifying these weaknesses, the researchers hope to push the development of more dependable and useful AI tools for researchers.

Abstract

The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench

View Paper