Latent Chain-of-Thought for Visual Reasoning
Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao
2025-10-29
Summary
This paper focuses on improving how well large AI models that handle both images and language can 'think through' problems, a process called chain-of-thought reasoning.
What's the problem?
Current methods for teaching these models to reason, like fine-tuning or reinforcement learning, often struggle when faced with new types of reasoning tasks. They also tend to rely too much on a single, potentially flawed, scoring system that tells the model if its reasoning is good or bad, leading to predictable or 'hacked' solutions instead of genuine understanding.
What's the solution?
The researchers approached reasoning as a process of making educated guesses, using a technique called variational inference. They developed a new training method that encourages the model to explore *different* possible reasoning paths, rather than just sticking to the one the reward system favors. This is done by creating a reward system that values diversity in reasoning, and by efficiently finding the best reasoning steps without needing to try out every single possibility. They essentially made the model more creative and less reliant on a single 'right answer' provided by the reward system.
Why it matters?
This work is important because it makes these AI models more reliable and adaptable. By improving their reasoning abilities and reducing their dependence on biased rewards, we can trust them to solve a wider range of problems and understand *why* they arrived at a particular answer, which is crucial for real-world applications.
Abstract
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.