Cognitive Foundations for Reasoning and Their Manifestation in LLMs
Priyanka Kargupta, Shuyue Stella Li, Haocheng Wang, Jinu Lee, Shan Chen, Orevaoghene Ahia, Dean Light, Thomas L. Griffiths, Max Kleiman-Weiner, Jiawei Han, Asli Celikyilmaz, Yulia Tsvetkov
2025-11-26
Summary
This research explores why large language models (LLMs), despite being good at complicated tasks, often struggle with simpler ones, suggesting they 'think' differently than humans. The study aims to bridge the gap between how LLMs and people reason by looking at the specific mental steps involved in problem-solving.
What's the problem?
LLMs can get the right answer, but it's not clear *how* they're doing it. They often fail on problems that seem easy for humans, hinting that they aren't using the same reasoning processes we do. The core issue is a lack of understanding of what cognitive abilities LLMs possess and how they utilize them, or don't, when solving problems. Current research tends to focus on easily measurable aspects of reasoning, potentially overlooking crucial elements.
What's the solution?
Researchers created a detailed framework based on cognitive science, identifying 28 different mental 'building blocks' involved in reasoning. They then tested 18 different LLMs on a huge number of tasks – over 192,000 – across text, images, and sound, comparing their performance to how people actually think through problems (gathered from 54 'think-aloud' sessions). They found that LLMs tend to rely on a rigid, step-by-step approach and don't use the more flexible, abstract thinking that humans do. Finally, they developed a method to 'guide' the LLMs during problem-solving, prompting them to use more effective reasoning strategies, which significantly improved their results.
Why it matters?
This work is important because it provides a way to systematically understand and improve how LLMs reason. By connecting AI research to established principles of human cognition, it moves beyond simply getting correct answers and focuses on *how* those answers are reached. This could lead to the development of more robust and reliable AI systems that aren't just good at mimicking patterns, but truly understand and solve problems like humans do, and also provides a new way to test theories about how *we* think.
Abstract
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.