ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian, Xin Wang, Yanwei Fu
2025-12-08
Summary
This paper introduces a new method, ReVSeg, for identifying and tracking objects in videos based on instructions given in natural language. It focuses on making the 'thinking' process of the computer more clear and effective.
What's the problem?
Current methods for video object segmentation struggle when the instructions require understanding *why* something is happening in the video, like understanding cause and effect or how things change over time. They often simplify these complex ideas into basic computer code, making it hard to understand *how* the computer arrived at its answer and limiting its ability to handle nuanced instructions.
What's the solution?
ReVSeg breaks down the reasoning process into three distinct steps: first, understanding what the instruction means; second, finding the most relevant parts of the video over time; and third, pinpointing the object's location in each frame. It uses existing, powerful vision-language models but guides them through these steps sequentially, like a series of decisions. To improve performance, the system learns from its mistakes using a technique called reinforcement learning, essentially teaching itself to make better choices.
Why it matters?
This work is important because it improves the accuracy of video object segmentation, especially when the task requires understanding complex reasoning. More importantly, it makes the reasoning process transparent, allowing us to see *how* the computer is interpreting the instructions and making its decisions, which is crucial for building trustworthy and reliable AI systems.
Abstract
Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .