Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

C. Opus, A. Lawsen

2025-06-15

Comment on The Illusion of Thinking: Understanding the Strengths and
Limitations of Reasoning Models via the Lens of Problem Complexity

Summary

This paper talks about how some advanced reasoning AI models sometimes seem to fail at solving complex planning puzzles, but these failures are often not because the models can't think, but because of how the experiments to test them were set up.

What's the problem?

The problem is that the tests used to measure the AI's reasoning ability have limits like token limits or include impossible puzzles, which cause the models to appear as if they fail when in fact they are just limited by these rules or trying to avoid long answers.

What's the solution?

The solution was to carefully look into these test designs and realize that many reported failures happen because the models know they have limits and stop outputting long answers accordingly. By adjusting how the tests are done, such as asking models to generate functions instead of long move lists and making sure puzzles are possible, the AI shows much better reasoning than previously thought.

Why it matters?

This matters because it helps us understand that reasoning AI might be more capable than tests suggest, but we need better ways to evaluate them. This can lead to smarter AI that we trust more for complex planning and problem-solving tasks.

Abstract

Evaluation artifacts, particularly token limits and impractical instances in benchmarks, lead to misreported failures in Large Reasoning Models on planning puzzles.

View Paper