Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

Ci-Siang Lin, Min-Hung Chen, I-Jieh Liu, Chien-Yi Wang, Sifei Liu, Yu-Chiang Frank Wang

2025-10-13

Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

Summary

This paper focuses on a problem in computer vision called Referring Video Object Segmentation, which is about identifying and outlining a specific object in a video based on a text description. The research introduces a new approach called Tenet to make this process more efficient and accurate.

What's the problem?

Current methods for solving this problem usually require a lot of computational power and labeled data – specifically, detailed outlines of objects in many video frames. This makes it difficult to scale up these methods to handle larger videos or new objects. The core issue is how to effectively connect the text description to the correct object throughout the video without needing massive amounts of training data and processing.

What's the solution?

The researchers broke down the problem into three parts: understanding the text description (referring), tracking the object through the video (video), and actually creating the outline of the object (segmentation). They focused on the first two parts by creating 'temporal prompts' – essentially, cues generated from object detectors and trackers that tell a pre-trained image segmentation model *where* to look in each frame. Because these prompts aren't always perfect, they also developed a way to learn which prompts are most reliable, called 'Prompt Preference Learning'. This allows them to use powerful, already-trained image segmentation models without needing to retrain them extensively for videos.

Why it matters?

This work is important because it offers a more practical and efficient way to identify objects in videos based on text descriptions. By leveraging existing, powerful image segmentation models and focusing on how to adapt them to video, it reduces the need for huge datasets and extensive training, making the technology more accessible and scalable for real-world applications like video editing, robotics, and automated video analysis.

Abstract

Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.

View Paper