Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

Kaixuan Lu, Mehmet Onurcan Kaya, Dim P. Papadopoulos

2025-12-10

Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

Summary

This paper introduces a new method, AutoQ-VIS, for automatically identifying and tracking different objects within videos without needing humans to manually label the data. It's a type of computer vision task called video instance segmentation.

What's the problem?

Normally, teaching a computer to do video instance segmentation requires a lot of painstaking work – someone has to draw outlines around each object in each frame of a video, and also make sure those outlines stay consistent as the object moves. Existing methods that try to avoid this manual labeling use fake, computer-generated videos, but these don't always translate well to real-world videos because of differences in appearance and movement.

What's the solution?

The researchers created a system where the computer learns from its own predictions. It starts by learning on the fake videos, then uses those skills to make guesses on real videos. Crucially, it also has a way to automatically check the *quality* of its own guesses and use the good ones to improve itself, creating a cycle of learning and refinement. This 'quality-guided self-training' helps it adapt from the fake data to real-world videos.

Why it matters?

This work is important because it significantly improves the accuracy of unsupervised video instance segmentation, achieving better results than previous methods without any human labeling. This means we can potentially build systems that understand and analyze videos more effectively, without the huge cost and effort of manual annotation, opening doors for applications like self-driving cars and video surveillance.

Abstract

Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 AP_{50} on YouTubeVIS-2019 val set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.

View Paper