Spotlight on Token Perception for Multimodal Reinforcement Learning
Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng
2025-10-14
Summary
This paper investigates how to improve the reasoning abilities of large vision-language models (LVLMs) using a technique called Reinforcement Learning with Verifiable Rewards (RLVR), but focuses on the often-overlooked aspect of how well the model 'sees' and understands images during this learning process.
What's the problem?
Current methods for improving LVLMs with RLVR don't pay enough attention to the visual part of the process. The paper found that when these models are reasoning about images, only a small number of the words they generate actually rely heavily on what's in the picture. Also, different reasoning paths the model takes can vary greatly in how much they depend on visual information, meaning some paths are more grounded in the image than others.
What's the solution?
To address this, the researchers developed a new algorithm called Visually-Perceptive Policy Optimization (VPPO). This algorithm works by giving more weight to reasoning paths that strongly use visual information and by focusing the learning process on the specific words that are most connected to the image. Essentially, it helps the model learn to pay closer attention to the relevant parts of the image when making decisions.
Why it matters?
This work is important because it provides a new way to analyze and improve how LVLMs reason about images. By explicitly considering visual perception during the learning process, the researchers were able to significantly boost the performance of these models on tasks that require both understanding language and interpreting visual information, and it works well even with different sized models.
Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.