Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta

2026-03-13

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Summary

This research investigates whether AI agents that can work with different types of documents, like PDFs, are actually thinking strategically when they complete tasks, or if they're just guessing and trying many different approaches until they stumble upon the right answer.

What's the problem?

Currently, it's unclear if advanced AI agents truly *understand* how to efficiently find information within documents. They might get the correct answer, but it doesn't mean they're doing it in a smart way. The problem is figuring out if these agents are using genuine reasoning skills or just a lot of random attempts. There wasn't a good way to test this specifically, and existing tests didn't clearly show the difference between a clever agent and a lucky one.

What's the solution?

The researchers created a new test called MADQA, which includes 2,250 questions based on 800 different PDF documents. These questions were designed to be challenging and to highlight differences in how well agents perform. They also developed a new method for evaluating the agents, looking at how accurately they answer questions *compared to* how much effort they take to do so. This 'accuracy-effort trade-off' helps determine if an agent is being efficient or just brute-forcing its way to a solution. They then tested several of the best AI agents using this new benchmark.

Why it matters?

This work is important because it shows that even the best AI agents still struggle with strategic thinking when dealing with documents. While they can sometimes match human accuracy, they do so by trying many more things and often get stuck in unproductive loops. This research points out the need to move beyond simply getting the right answer and focus on building AI that can reason effectively and efficiently, which is crucial for automating complex tasks that involve understanding and using information from documents.

Abstract

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

View Paper