S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation
Leon Sick, Lukas Hoyer, Dominik Engel, Pedro Hermosilla, Timo Ropinski
2025-12-17
Summary
This paper introduces a new method for identifying different objects within videos, a process called video instance segmentation, without needing someone to manually label what each object is.
What's the problem?
Current methods for this task often rely on fake videos created by simply moving images around, which doesn't accurately represent how things move in real-world videos. Real videos have complex movements like objects changing shape, perspectives shifting, and the camera itself moving, all of which are hard to simulate with simple image manipulation.
What's the solution?
The researchers developed a model that learns directly from real videos. They started with initial, somewhat noisy, object segmentations for each frame and then used an understanding of how things *should* move – a 'motion prior' – to identify the most reliable segmentations, called 'keymasks'. These keymasks were then used to 'teach' the model to fill in the gaps and create consistent segmentations across the entire video, using a technique called 'Sparse-To-Dense Distillation' and a special loss function called 'Temporal DropLoss'.
Why it matters?
This work is important because it allows for more accurate object identification in videos without the need for expensive and time-consuming manual labeling. By learning from real-world motion, the model performs better than existing methods and opens the door for more advanced video analysis applications.
Abstract
In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.