Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
Rujiao Long, Yang Li, Xingyao Zhang, Weixun Wang, Tianqianjin Lin, Xi Zhao, Yuchi Xu, Wenbo Su, Junchi Yan, Bo Zheng
2025-12-23
Summary
This paper introduces a new technique called Reasoning Palette to improve how large language models, especially those that can process images and text, think through problems and learn from feedback.
What's the problem?
Large language models sometimes struggle with complex reasoning because they get stuck repeating similar thought processes, lacking diversity in how they approach a question. When trying to *learn* how to reason better through a process called reinforcement learning, they can waste time exploring unproductive strategies, making training slow and inefficient.
What's the solution?
Reasoning Palette adds a hidden 'strategy selector' to the model. Before generating an answer, the model samples a latent variable that represents a different reasoning approach. This variable is created using a special tool called a variational autoencoder, which learns to encode different reasoning styles from question-answer examples. The selected strategy then influences how the model generates its response, essentially guiding its thought process. A little bit of extra training helps the model learn to use these strategies effectively, and during reinforcement learning, it allows the model to deliberately try out different reasoning modes.
Why it matters?
This method gives us more control over *how* the model reasons, making its thinking more understandable and reliable. It also makes the learning process much faster and more effective, leading to better performance on challenging reasoning tasks. Ultimately, it helps these powerful models become better problem-solvers.
Abstract
Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.