DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, Wenlong Zhang, Philip Torr, Dongzhan Zhou

2025-09-05

DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

Summary

This paper introduces a new way to test 'deep research agents' – which are AI systems designed to do things like literature reviews and come up with research ideas. They created a challenging benchmark called DeepResearch Arena to see how well these agents can actually perform research.

What's the problem?

Evaluating how good these AI research agents are is really hard. Existing tests often use questions that are too simple or might have been accidentally included in the data the AI was trained on, giving it an unfair advantage. It's difficult to find genuinely new and interesting research questions that would truly test an agent's abilities, and that reflect the kind of thinking researchers actually do.

What's the solution?

The researchers built DeepResearch Arena by analyzing transcripts from real academic seminars. They developed a system called MAHTG that automatically finds promising research ideas within these seminar discussions and turns them into specific research tasks. This system ensures the tasks are traceable back to the original seminar and filters out irrelevant information. They ended up with over 10,000 tasks across many different academic fields.

Why it matters?

This new benchmark is important because it provides a more realistic and rigorous way to evaluate AI research agents. By using questions derived from actual expert discussions, it’s less likely the AI has already 'seen' the answers and more likely to truly demonstrate its research capabilities. The results show current AI agents still have a long way to go before they can match human researchers.

Abstract

Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

View Paper