The Markovian Thinker

Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, Siva Reddy

2025-10-09

Summary

This paper explores a new way to train large language models (LLMs) to think through complex problems step-by-step, a process called reasoning. It focuses on making this reasoning process more efficient, especially when dealing with very long chains of thought.

What's the problem?

Current methods for training LLMs to reason, like Reinforcement Learning with LongCoT, struggle with long reasoning chains. The way these systems are set up requires them to remember everything that's already been thought about, which becomes incredibly computationally expensive as the reasoning gets longer. Essentially, the amount of processing power needed grows dramatically with each step of thought, making it hard to scale to really complex problems.

What's the solution?

The researchers propose a new approach called 'Markovian Thinking'. Instead of making the model remember the entire history of its reasoning, they break the thinking process into fixed-size chunks. At the end of each chunk, the model summarizes its progress into a short 'state' which is then used to start the next chunk. This is like taking notes after each section of a long book – you don't need to reread the whole book to understand the next section. They built an environment called 'Delethink' to implement this, and used Reinforcement Learning to train the model to write these helpful summary states.

Why it matters?

This work is important because it significantly reduces the computational cost of long-form reasoning in LLMs. By making the process more efficient, it allows models to tackle much more complex problems and think through longer chains of thought without requiring massive amounts of computing power. The researchers show that their method can achieve similar or better results than existing methods, but with a fraction of the cost, paving the way for more scalable and practical reasoning AI.

Abstract

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

View Paper