LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty

2025-10-17

LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

Summary

This paper introduces a new way to test how well AI systems can do in-depth research, meaning they can find information from many websites and write comprehensive reports, just like a human researcher would.

What's the problem?

Currently, there aren't good benchmarks, or tests, to accurately measure how well AI can perform this kind of 'deep research'. Existing tests are often too simple, focus on very specific topics, or ask questions that are open to interpretation, making it hard to compare different AI systems fairly. They don't really challenge the AI to go out and find *current* information from the real web.

What's the solution?

The researchers created a new benchmark called LiveResearchBench, which includes 100 research tasks designed to be realistic, requiring up-to-date information, clear questions, and searching through many different websites. They also developed a tool called DeepEval to automatically assess the quality of the reports the AI generates, looking at things like accuracy, how well the information is presented, and if sources are properly cited. They then used these tools to test 17 different AI systems.

Why it matters?

This work is important because it provides a more reliable way to evaluate and improve AI systems that are designed to do complex research. As AI gets better at accessing and processing information, it could become a powerful tool for tasks like writing reports, conducting investigations, or even assisting with scientific discovery, but only if we can accurately measure and improve its abilities.

Abstract

Deep research -- producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources -- marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.

View Paper