Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization

Mohammad Mahdi Samiei Paqaleh, Arash Marioriyad, Arman Tahmasebi-Zadeh, Mohamadreza Fereydooni, Mahdi Ghaznavai, Mahdieh Soleymani Baghshah

2025-10-13

Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization

Summary

This paper tackles the challenge of defining and measuring reasoning ability in AI, particularly in large language models. While AI is getting better at complex tasks, we lack a good way to determine if it's *actually* reasoning, or just recognizing patterns. The paper proposes a new framework called 'Complexity Out of Distribution' (Complexity OoD) generalization to address this.

What's the problem?

Currently, it's hard to tell if an AI is truly reasoning or simply memorizing and applying patterns it's seen before. Existing methods for evaluating AI, like testing how well it performs on new data, don't specifically target reasoning skills. The core issue is that we need a way to test an AI's ability to handle problems that require more complex thinking – problems that go beyond what it was directly trained on, not just different examples of the same type of problem. It's like testing if a student can apply concepts to a completely new situation, not just solve similar practice problems.

What's the solution?

The authors suggest evaluating AI by testing it on problems that require a higher level of 'complexity' than anything it saw during training. This complexity can be measured in a couple of ways: how intricate the solution needs to be (representational complexity) or how many steps of reasoning are required to solve it (computational complexity). They connect this idea to 'Kolmogorov complexity,' which is a theoretical measure of how concise a solution can be, and propose practical ways to estimate complexity, like counting objects, relationships, or reasoning steps. Essentially, they want to see if an AI can handle problems that demand more 'brainpower' than it's used to.

Why it matters?

This work is important because it provides a new way to push AI development beyond simply scaling up models and datasets. If we want AI to truly reason, we need to specifically test and train for that ability. The paper suggests changes to how we design benchmarks, how we train AI (focusing on the *process* of solving problems, not just the answer), and even how we build the AI's underlying architecture to better handle complex reasoning tasks. It highlights that simply giving an AI more data won't automatically make it a better reasoner; we need to focus on building systems that can actively manage and allocate their 'thinking' resources.

Abstract

Recent progress has pushed AI frontiers from pattern recognition tasks toward problems that require step by step, System2 style reasoning, especially with large language models. Yet, unlike learning, where generalization and out of distribution (OoD) evaluation concepts are well formalized, there is no clear, consistent definition or metric for reasoning ability. We propose Complexity Out of Distribution (Complexity OoD) generalization as a framework and problem setting to define and measure reasoning. A model exhibits Complexity OoD generalization when it maintains performance on test instances whose minimal required solution complexity, either representational (richer solution structure) or computational (more reasoning steps/program length), exceeds that of all training examples. We formalize complexity via solution description Kolmogorov complexity and operational proxies (e.g., object/relation counts; reasoning step counts), clarifying how Complexity OoD differs from length and compositional OoD. This lens unifies learning and reasoning: many cases solvable with System1 like processing at low complexity become System2 like under complexity pressure, while System2 can be viewed as generalization over solution structures. We translate this perspective into practice with recommendations for operationalizing Complexity OoD across the stack: incorporating complexity into benchmark and evaluation metric design, rethinking supervision to target solution traces, seeking and designing inductive biases for Complexity OoD generalization, addressing learning to reason spillovers such as spurious shortcuts, semantic robustness, catastrophic forgetting, and step wise calibration. Because Complexity OoD cannot be solved by scaling data alone, progress toward robust reasoning will require architectures and training regimes that explicitly model and allocate computation with respect to complexity.

View Paper