Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, Ming-Ming Cheng

2026-03-25

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Summary

This paper focuses on improving how artificial intelligence systems 'think' when dealing with both images and text, specifically when they need to explain their reasoning step-by-step.

What's the problem?

Current AI methods that try to mimic human-like reasoning with images and text often treat each step of the reasoning process as equally important. However, some steps rely heavily on understanding the image itself (perceptual grounding), while others involve more general problem-solving (inference). Existing methods don't effectively distinguish between these different types of steps, leading to less effective reasoning.

What's the solution?

The researchers analyzed successful reasoning processes and found that they have a clear structure: steps grounded in the image are followed by steps that explore different possibilities. They then developed a new technique called Perception-Exploration Policy Optimization (PEPO). PEPO essentially gives the AI a 'boost' when it's focusing on the image and another 'boost' when it's exploring different ideas, guiding it to balance these two aspects of reasoning. It works by looking at how similar the AI's internal state is to previous states to determine how much it should focus on the image, and uses the unpredictability of its choices to encourage exploration.

Why it matters?

This work is important because it makes AI systems that combine images and text much better at complex reasoning tasks like solving visual puzzles, understanding geometric relationships, and classifying images with limited examples. By improving the reasoning process, these AI systems can become more reliable and capable in real-world applications.

Abstract

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO

View Paper