Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

Maojia Song, Renhang Liu, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Soujanya Poria, Jingren Zhou

2025-10-08

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

Summary

This paper introduces a new way to test how well AI systems can answer complex questions that require searching for information online and then using that information to form an answer, a process called Retrieval-Augmented Generation (RAG). It highlights weaknesses in current systems and proposes a new benchmark and a better approach to building these AI systems.

What's the problem?

Current tests for these AI systems are flawed because the questions often give away clues about the answer, making it easier for the AI to cheat by recognizing patterns instead of truly reasoning. Also, these tests usually only look at whether the AI gets the final answer right or wrong, without understanding *why* it failed – was it a problem with finding the right information, understanding it, or knowing when it doesn't have enough information to answer? This makes it hard to improve these systems effectively.

What's the solution?

The researchers created a new benchmark called WebDetective, which presents questions without giving away the answer and provides a controlled environment to track exactly what information the AI accesses. They also developed a new evaluation method that breaks down performance into three parts: how well the AI searches for information, how well it uses the information it finds, and how well it avoids answering when it doesn't have enough evidence. Finally, they built a new AI workflow called EvidenceLoop that focuses on verifying information and keeping track of evidence, which improves both searching and answering.

Why it matters?

This work is important because it shows that current AI systems are good at following instructions but struggle with independent reasoning. By identifying these weaknesses and providing a better way to test and build these systems, the researchers are helping to move the field towards creating AI that can truly think for itself and solve complex problems autonomously, rather than just mimicking patterns.

Abstract

RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

View Paper