DREAM: Deep Research Evaluation with Agentic Metrics
Elad Ben Avraham, Changhao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Adi Kalyanpur, Ron Litman
2026-02-25
Summary
This paper focuses on the difficulty of accurately judging the quality of reports created by AI 'research agents' – programs designed to do in-depth research and write summaries. It argues that current methods for evaluating these agents are flawed because they can be fooled by reports that *sound* good and use correct citations, even if the information is actually wrong or outdated.
What's the problem?
The main problem is that we lack a good way to tell if an AI research agent is actually doing good research. Existing evaluation methods often fall for what the authors call the 'Mirage of Synthesis,' meaning a report can appear impressive on the surface but contain factual errors or outdated information. This happens because the tools used to evaluate these reports are 'static' – they don't have the ability to actively check facts or understand how information changes over time, unlike a human researcher.
What's the solution?
To fix this, the researchers created a new evaluation framework called DREAM (Deep Research Evaluation with Agentic Metrics). DREAM uses AI to evaluate AI, essentially matching the agent's capabilities for research with the capabilities of the evaluator. It does this by combining standard checks with a tool-using AI agent that can actively verify facts, check for temporal validity (if the information is current), and test the agent's reasoning. This allows for a more thorough and reliable assessment of the research report.
Why it matters?
This work is important because as AI research agents become more common, we need reliable ways to ensure they are producing trustworthy information. DREAM provides a more robust and scalable method for evaluating these agents, moving beyond simply checking for good writing and correct citations to actually verifying the accuracy and timeliness of the research. This is crucial for building confidence in AI-generated research and preventing the spread of misinformation.
Abstract
Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.