Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li, Sizhe Dang, Chengzu Li, Harry Yang, Guang Dai, Mengmeng Wang, Jingdong Wang

2025-10-08

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

Summary

This paper introduces a new method, called FlowRVS, for accurately identifying and outlining specific objects in videos based on a text description you give it. It's about teaching computers to 'understand' what you want segmented in a video just from your words.

What's the problem?

Existing methods for Referring Video Object Segmentation often break down the task into two steps: first finding the object generally, then precisely outlining it. This two-step process loses important information and struggles to keep the object consistently identified throughout the video, especially when the object is moving or changing appearance. The initial 'finding' step simplifies the description too much, leading to inaccuracies.

What's the solution?

FlowRVS takes a different approach. Instead of finding then segmenting, it treats the whole process as a continuous change or 'deformation' of the entire video. It uses powerful pre-trained models that already understand relationships between text and video, and then learns to directly transform the video into a precise outline of the object you described. It's a single-step process that keeps the language description connected to the segmentation throughout the video.

Why it matters?

This new method achieves better results than previous approaches on standard tests, significantly improving the accuracy of identifying objects in videos based on text instructions. This is a big step forward for applications like video editing, robotics, and creating more intuitive ways to interact with video content, because it shows computers are getting better at understanding what we *mean* when we describe things in videos.

Abstract

Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment' pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a J&F of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.

View Paper