Schoenfeld's Anatomy of Mathematical Reasoning by Language Models
Ming Li, Chenrui Fan, Yize Cheng, Soheil Feizi, Tianyi Zhou
2025-12-26
Summary
This paper investigates how large language models 'think' when solving problems, moving beyond just looking at the words they produce to understand the actual steps they take in their reasoning process.
What's the problem?
While large language models can now show *how* they arrive at an answer, it's still really hard to figure out *what* they're actually doing internally – what specific thought processes are happening. Just looking at the sequence of words (tokens) doesn't give us a clear picture of their reasoning structure, and it's difficult to compare how different models approach problems.
What's the solution?
The researchers used a framework inspired by how humans solve problems, called Schoenfeld's Episode Theory. They created a system called ThinkARM that breaks down a model's reasoning into distinct steps like analyzing the problem, exploring possible solutions, implementing a plan, and verifying the answer. By applying ThinkARM to different models solving math problems, they could identify common patterns in how 'good' reasoners think versus those that don't, and see how these steps play out.
Why it matters?
This work is important because it gives us a way to systematically study the inner workings of these powerful AI models. By making reasoning steps explicit, we can better understand how to improve their reasoning abilities, identify weaknesses, and even see how things like trying to make a model faster might affect its ability to think critically.
Abstract
Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.