Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu

2025-08-07

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Summary

This paper talks about Chain-of-Thought (CoT) reasoning in large language models, which is a way these AI systems try to solve problems step-by-step by breaking down complex questions into smaller parts. However, it finds that the ability of these models to do this kind of reasoning is limited because the examples they see during training are different from the problems they face during testing.

What's the problem?

The problem is that while Chain-of-Thought reasoning helps large models answer tricky questions better, this improvement doesn’t hold up well when the test problems are very different from what the models saw before. This difference in the kind of problems the models train on versus those they actually have to solve makes their reasoning unreliable and fragile.

What's the solution?

The paper studies this issue by looking at the differences between training and testing data to understand why the reasoning falls apart. It shows that the success of Chain-of-Thought depends a lot on how similar the new problems are to the training ones, meaning it is not a strong or general way for models to reason independently of their training data.

Why it matters?

This matters because many people believe that Chain-of-Thought reasoning makes AI smarter and able to think like humans. But if it only works in some situations and fails when facing new kinds of problems, then relying on it could be misleading. Understanding its limits helps researchers develop better ways to improve AI reasoning in a more reliable and general way.

Abstract

CoT reasoning in LLMs is found to be limited by the distribution discrepancy between training and test data, suggesting it is not a robust form of reasoning.

View Paper