Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue

2025-02-06

Demystifying Long Chain-of-Thought Reasoning in LLMs

Summary

This paper talks about how to make AI models better at explaining their thinking process, called chain-of-thought reasoning, especially for complex problems. The researchers studied different ways to train these models to produce longer and more detailed explanations.

What's the problem?

AI models are getting better at solving complex problems, but it's not always clear how they arrive at their answers. Making them explain their reasoning step-by-step is challenging, especially when trying to get them to provide longer, more detailed explanations.

What's the solution?

The researchers used a combination of supervised learning and reinforcement learning to train AI models. They found that while supervised learning isn't absolutely necessary, it helps make the training process smoother. They also discovered that using carefully designed reward systems and filtering out low-quality information from the internet can help models learn to explain their thinking better, especially for tough problems in science and math.

Why it matters?

This research matters because it helps make AI more transparent and trustworthy. When AI can explain its reasoning clearly, it's easier for humans to understand and verify its decisions. This is especially important as AI is used more often in complex fields like science, medicine, and engineering. It also helps AI tackle new types of problems it hasn't seen before, making it more versatile and useful in real-world situations.

Abstract

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and <PRE_TAG>error correction</POST_TAG>. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like <PRE_TAG>error correction</POST_TAG> are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

View Paper