Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang, Mingqing Chen, Andrew Hard, Rajiv Mathews, Lun Wang
2026-01-01
Summary
This paper investigates how large language models (LLMs) actually *think* when they're solving problems, focusing on the internal processes rather than just the final answer.
What's the problem?
Currently, understanding how LLMs reason is difficult. Existing methods rely on humans to define what 'good' reasoning looks like – things like 'reflecting' or 'overthinking' – and then try to find those patterns in the model's activity. This is limiting because there are likely many reasoning strategies we haven't even thought of, and it's hard to capture the nuances of reasoning just by looking at individual words.
What's the solution?
The researchers developed a new method called RISE, which stands for Reasoning behavior Interpretability via Sparse auto-Encoder. It's a way to automatically discover different 'reasoning vectors' within the LLM. Think of these vectors as directions in the model's internal workings that represent specific ways of thinking. They break down the LLM’s thought process into steps and then use a special type of neural network, a sparse auto-encoder, to find patterns in those steps. This allows them to identify behaviors like reflection or backtracking without needing to predefine them. They can even manipulate these vectors to make the model reason in a certain way.
Why it matters?
This work is important because it provides a way to peek inside the 'black box' of LLMs and understand *how* they arrive at their answers. It’s not just about whether the answer is correct, but about the reasoning process itself. This understanding could help us build more reliable, controllable, and even more intelligent AI systems, and potentially discover reasoning strategies we hadn't considered before.
Abstract
Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.