Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet
James Xu Zhao, Bryan Hooi, See-Kiong Ng
2025-09-09
Summary
This research investigates whether letting AI models 'think' for longer during use – a technique called test-time scaling – actually improves their performance on tasks requiring a lot of factual knowledge, like answering questions based on information they've learned.
What's the problem?
While letting AI models reason for a longer time often helps with general tasks, this paper shows it doesn't automatically make them better at knowledge-intensive tasks. In fact, simply increasing the 'thinking time' can sometimes *increase* errors and cause the model to confidently state incorrect information, known as 'hallucinations'. The core issue is that more computation doesn't guarantee more accuracy when dealing with facts.
What's the solution?
The researchers tested 12 different AI models on two challenging benchmarks that require strong factual knowledge. They compared the models' performance with different amounts of 'thinking time'. They didn't just look at accuracy, but also carefully analyzed *why* the models made mistakes, specifically focusing on whether longer reasoning led to fewer hallucinations or better factual recall. They found that often, models avoid answering when they think longer, reducing hallucinations simply by abstaining, rather than actually improving their knowledge.
Why it matters?
This work is important because it shows that simply scaling up the reasoning process isn't a guaranteed solution for improving AI reliability, especially when factual correctness is crucial. It highlights the need to develop better methods for ensuring AI models are truthful and avoid making things up, even when given more time to think. It suggests that focusing on *how* models reason, rather than just *how long* they reason, is key to building trustworthy AI systems.
Abstract
Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge