DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin
2025-08-28
Summary
This paper introduces a new way to test AI systems that automatically do research and write summaries, similar to how a student would write a literature review for a paper.
What's the problem?
Currently, it's really hard to tell how good these AI research systems are. Existing tests either ask simple factual questions or rely on datasets that can become outdated or even accidentally 'teach' the AI the answers. These tests don't reflect the complex, real-world task of actually understanding and summarizing a body of research, which is constantly changing.
What's the solution?
The researchers created a benchmark called DeepScholar-bench. This benchmark gives the AI research questions based on recent scientific papers and asks it to write a 'related work' section – essentially, a summary of what other research has been done on the topic, with proper citations. They also built a system called DeepScholar-base as a baseline to compare other AI systems against, and developed a way to automatically evaluate the AI's work based on how well it synthesizes information, finds relevant sources, and verifies its claims.
Why it matters?
This work is important because it provides a more realistic and challenging way to evaluate AI's ability to perform research. The results show that current AI systems still have a long way to go before they can truly conduct research and write summaries at a human level, and highlights the need for continued development in this area.
Abstract
The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of 19% across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.