Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLMs
Hongyuan Yuan, Xinran He, Run Shao, Bolei He, Xianwei Xue, Mengke Chen, Qiutong Pan, Haiwei Wang, Haifeng Li
2026-04-09
Summary
This paper focuses on making large language models (LLMs) reason more efficiently. When you try to teach these models to 'think step-by-step' (a technique called Chain-of-Thought reasoning), they sometimes get stuck repeating themselves or checking things that don't really matter, making their reasoning process unnecessarily long.
What's the problem?
The issue is that when LLMs are rewarded for good reasoning, the reward signal isn't always clear, leading to 'overthinking'. This overthinking shows up in two main ways: the model checks everything constantly without much benefit (Indiscriminate Reflection), and it keeps re-confirming things it already knows are true (Repetitive Reflection). Essentially, the models are wasting steps and becoming less efficient.
What's the solution?
The researchers developed a method to streamline the reasoning process. They represent the step-by-step reasoning as a map of connected ideas, like a family tree. Then, they 'prune' this map in two ways: first, they remove branches that don't contribute much to the final answer, and second, they cut out the late-stage re-checking of already established facts. They then trained the model in stages, first to follow the pruned reasoning paths, then to prefer correct *and* concise answers, and finally to balance accuracy with length.
Why it matters?
This work is important because it makes LLMs more practical. By reducing the amount of text the model generates while maintaining or improving its accuracy, it becomes faster and cheaper to use. This is a big step towards deploying these powerful models in real-world applications where efficiency is key.
Abstract
Extending CoT through RL has been widely used to enhance the reasoning capabilities of LLMs. However, due to the sparsity of reward signals, it can also induce undesirable thinking patterns such as overthinking, i.e., generating redundant intermediate reasoning content. In this work, we argue that a major source of such redundancy is inefficient reflection, which often manifests in two problematic patterns: Indiscriminate Reflection, where the model performs broad, low-impact checks throughout reasoning, and Repetitive Reflection, where it repeatedly re-verifies an already established conclusion. To address this, we introduce a graph-based CoT optimization framework. Specifically, we convert each linear CoT into a directed acyclic graph (DAG) with explicit dependency edges, and design a dual pruning strategy: branch-level pruning removes weakly contributing reflection branches, while depth-level pruning eliminates late-stage re-verification. We distill this behavior via a three-stage pipeline: (1) SFT to initialize the policy on pruned concise traces, (2) DPO to prefer correct but less redundant trajectories, and (3) GRPO with length penalty to jointly optimize answer correctness and efficiency. Experiments show that our approach reduces the average reasoning tokens by 42\% while maintaining or improving accuracy.