When Reasoning Meets Its Laws
Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang
2025-12-22
Summary
This paper investigates why large language models, despite being powerful, sometimes make surprisingly bad decisions when reasoning through problems. It proposes a set of 'Laws of Reasoning' to describe how these models *should* behave, and then tests whether current models follow those laws.
What's the problem?
Large reasoning models (LRMs) often don't reason in a way that makes intuitive sense, even though they perform well on many tasks. This means we don't fully understand *how* they're getting their answers, and it limits their potential. Specifically, the paper focuses on the idea that more complex questions should require more 'computational effort' from the model, and that breaking down a complex question into simpler parts should lead to the same answer as tackling it all at once. It's hard to directly measure how much 'effort' a model is using, and how complex a question truly is.
What's the solution?
The researchers created a framework called the 'Laws of Reasoning' (LoRe) which includes a 'compute law' suggesting reasoning effort should increase with question complexity, and an 'accuracy law' building on that. Because directly measuring question complexity is difficult, they focused on two measurable properties of these laws: 'monotonicity' (more complex questions shouldn't be *easier* to answer) and 'compositionality' (breaking a question down shouldn't change the answer). They built a new benchmark, LoRe-Bench, to test these properties in existing models. They found most models are good at monotonicity but struggle with compositionality, so they developed a method to 'finetune' models to better follow the compositionality rule.
Why it matters?
Understanding how reasoning models work is crucial for improving them. By identifying these 'Laws of Reasoning' and showing that models which follow them perform better, this research provides a path towards building more reliable and capable AI systems. It suggests that focusing on *how* a model reasons, not just *what* answer it gives, is key to unlocking its full potential and making it more trustworthy.
Abstract
Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/