Over-Searching in Search-Augmented Large Language Models

Roy Xie, Deepak Gopinath, David Qiu, Dong Lin, Haitian Sun, Saloni Potdar, Bhuwan Dhingra

2026-01-12

Over-Searching in Search-Augmented Large Language Models

Summary

This paper investigates a problem with using search engines alongside large language models (LLMs) – specifically, that these systems often search for information even when it doesn't help, and can even make things worse.

What's the problem?

When LLMs are combined with search tools to answer questions, they frequently 'over-search'. This means they use the search engine even when the answer is already known or the search results are irrelevant. This wastes computing power and can lead to the LLM generating incorrect or misleading answers because it's incorporating bad information from the search. The researchers wanted to understand *when* and *why* this over-searching happens, looking at different types of questions, different LLMs, and how it changes during a conversation.

What's the solution?

The researchers systematically tested how often LLMs over-search under various conditions. They found that over-searching is worse with more complex LLMs, when the search results aren't very accurate, and as a conversation goes on. Interestingly, they discovered that including evidence that a question *can't* be answered actually helps the LLM avoid making things up. To measure this problem, they created a new metric called 'Tokens Per Correctness' which balances how well the LLM answers with how much searching it does. Finally, they explored ways to reduce over-searching and released a dataset to help other researchers work on this issue.

Why it matters?

Making these search-augmented LLMs more efficient is really important. Over-searching isn't just a waste of resources; it actively makes the LLM less reliable. By understanding *why* it happens and developing ways to prevent it, we can build AI systems that are both smarter and more trustworthy, especially as we rely on them for more complex tasks and longer interactions.

Abstract

Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search -- unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our finding shows: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA to foster continued research into efficient search-augmented LLMs.

View Paper