Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, Xipeng Qiu
2025-02-19
Summary
This paper talks about how well certain AI language models, like QwQ, Deepseek-R1, and LIMO, can improve their performance when given more time to think during use, a process called test-time scaling.
What's the problem?
Many people believe that giving these AI models more time to think (by generating longer chains of thought) always leads to better answers. However, the researchers found that this isn't always true. Sometimes, longer thinking processes actually led to worse answers, especially when the AI tried to correct itself.
What's the solution?
The researchers compared different ways of using extra thinking time and found that having the AI come up with multiple separate answers in parallel worked better than having it think through one long chain of thought. They also created a new method called 'Shortest Majority Vote' that combines the best parts of parallel thinking with choosing shorter answers, which often turned out to be more accurate.
Why it matters?
This matters because it challenges what we thought we knew about how to get the best performance out of AI language models. By showing that longer isn't always better and proposing a new way to use AI's thinking time, this research could help make AI systems more accurate and efficient in real-world applications, without needing to make the models themselves bigger or more complex.
Abstract
The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI's o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models' self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models' test-time scalability compared to conventional majority voting approaches.