Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck
Fabio Valerio Massoli, Andrey Kuzmin, Arash Behboodi
2026-03-23
Summary
This paper tackles the issue of large language models (LLMs) being accurate but expensive when they 'think out loud' using a technique called Chain-of-Thought prompting. It proposes a new way to make LLMs more efficient without sacrificing their reasoning abilities.
What's the problem?
When LLMs use Chain-of-Thought, they generate detailed reasoning steps to arrive at an answer. While this improves accuracy, it also means they use a lot of computing power and take longer to respond because of the increased number of words (tokens) they produce. Previous attempts to reduce this cost often end up cutting out important reasoning steps along with unnecessary fluff, hurting performance. The core issue is that standard methods for making things more efficient don't quite fit how LLMs, specifically transformers, work because of how they pay attention to different parts of the input.
What's the solution?
The researchers approached this like a compression problem – trying to squeeze the reasoning process down to its most essential parts. They used a concept called the Conditional Information Bottleneck, which focuses on making sure the reasoning steps only contain information *needed* to get the final answer, information not already present in the original question. They developed a way to train the LLM to prioritize important reasoning and cut out unnecessary details using a special training process that rewards both accuracy and conciseness. Instead of just counting tokens, they used a more sophisticated method that measures how 'surprising' each word is to the model, effectively targeting less important words for removal.
Why it matters?
This work is important because it offers a way to make powerful LLMs more practical and affordable. By improving efficiency without significantly impacting accuracy, it opens the door to using these models in more real-world applications where cost and speed are critical. It provides a more intelligent approach to reducing LLM costs than simply limiting the length of their responses, and could lead to faster, cheaper, and more accessible AI.
Abstract
Chain-of-Thought (CoT) prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing "Budget Forcing" methods reducing cost via fine-tuning with heuristic length penalties, suppress both essential reasoning and redundant filler. We recast efficient reasoning as a lossy compression problem under the Information Bottleneck (IB) principle, and identify a key theoretical gap when applying naive IB to transformers: attention violates the Markov property between prompt, reasoning trace, and response. To resolve this issue, we model CoT generation under the Conditional Information Bottleneck (CIB) principle, where the reasoning trace Z acts as a computational bridge that contains only the information about the response Y that is not directly accessible from the prompt X. This yields a general Reinforcement Learning objective: maximize task reward while compressing completions under a prior over reasoning traces, subsuming common heuristics (e.g., length penalties) as special cases (e.g., uniform priors). In contrast to naive token-counting-based approaches, we introduce a semantic prior that measures token cost by surprisal under a language model prior. Empirically, our CIB objective prunes cognitive bloat while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop.