Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

Abhranil Chandra, Ayush Agrawal, Arian Hosseini, Sebastian Fischmeister, Rishabh Agarwal, Navin Goyal, Aaron Courville

2025-12-30

Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

Summary

This paper explores a surprising way to improve how well language models can reason. It turns out that training these models on examples of *how* to think, even if those examples ultimately lead to the wrong answer, can actually make them better at solving problems.

What's the problem?

Language models are getting better at many things, but they still struggle with complex reasoning tasks. Traditionally, researchers have tried to improve reasoning by giving models examples of questions *and* the correct answers, often created by humans. However, getting enough high-quality, human-labeled data is expensive and time-consuming. The question is, can we find a more efficient way to teach models to reason effectively?

What's the solution?

The researchers found that training models on reasoning examples generated by *other* language models – even if those examples contain mistakes – works surprisingly well. They think this is because the way another language model 'thinks' is more similar to how the model being trained 'thinks' than how a human thinks. They also discovered that even flawed reasoning steps can be helpful, as long as some of the steps are correct. To test this, they also tried re-writing human reasoning examples to sound more like a language model and found that improved performance. They tested this approach on different models and different types of reasoning problems, like math, coding, and logic puzzles.

Why it matters?

This research suggests that we don't always need perfect, human-created data to improve a language model's reasoning abilities. Using synthetic data – data created by other AI systems – can be a powerful and cost-effective alternative. It also highlights that the *process* of reasoning is important, and a correct final answer doesn't guarantee that the reasoning was sound. This could change how we create training datasets for AI in the future, focusing more on mimicking the model's internal thought processes.

Abstract

We present the surprising finding that a language model's reasoning capabilities can be improved by training on synthetic datasets of chain-of-thought (CoT) traces from more capable models, even when all of those traces lead to an incorrect final answer. Our experiments show this approach can yield better performance on reasoning tasks than training on human-annotated datasets. We hypothesize that two key factors explain this phenomenon: first, the distribution of synthetic data is inherently closer to the language model's own distribution, making it more amenable to learning. Second, these `incorrect' traces are often only partially flawed and contain valid reasoning steps from which the model can learn. To further test the first hypothesis, we use a language model to paraphrase human-annotated traces -- shifting their distribution closer to the model's own distribution -- and show that this improves performance. For the second hypothesis, we introduce increasingly flawed CoT traces and study to what extent models are tolerant to these flaws. We demonstrate our findings across various reasoning domains like math, algorithmic reasoning and code generation using MATH, GSM8K, Countdown and MBPP datasets on various language models ranging from 1.5B to 9B across Qwen, Llama, and Gemma models. Our study shows that curating datasets that are closer to the model's distribution is a critical aspect to consider. We also show that a correct final answer is not always a reliable indicator of a faithful reasoning process.

View Paper