SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

Zhaojun Sun, Xuzhou Zhu, Xuanhe Zhou, Xin Tong, Shuo Wang, Jie Fu, Guoliang Li, Zhiyuan Liu, Fan Wu

2025-10-06

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

Summary

This paper focuses on the challenge of automatically creating good literature reviews, which are usually written by people who spend a lot of time reading and summarizing research. It identifies that current AI systems aren't very good at this task and there hasn't been a good way to measure *how* bad they are.

What's the problem?

Writing a good survey paper, meaning a comprehensive overview of research on a topic, is hard work. New AI tools try to automate this, but the results aren't as good as what a human expert would produce. The biggest issue is that there wasn't a reliable method to really pinpoint the weaknesses of these AI-generated surveys and compare them to human-written ones in a detailed way.

What's the solution?

The researchers created a new evaluation system called SurveyBench. This system uses real research topics from thousands of recent scientific papers and existing high-quality surveys. It doesn't just look at whether the AI survey *says* the right things, but also tests if a reader could actually *understand* the topic better after reading the AI-generated survey, using quiz questions. It assesses things like how well the survey covers the topic, how logically it's organized, and how clearly it explains complex ideas.

Why it matters?

This work is important because it provides a much more thorough way to judge the quality of AI-generated literature reviews. By showing that current AI systems still fall short of human performance by a significant margin, it highlights areas where further research is needed to improve these tools and eventually help researchers stay up-to-date with the ever-growing amount of scientific information.

Abstract

Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

View Paper