Autoregressive Universal Video Segmentation Model
Miran Heo, Sukjun Hwang, Min-Hung Chen, Yu-Chiang Frank Wang, Albert Gu, Seon Joo Kim, Ryo Hachiuma
2025-08-27
Summary
This paper introduces a new model, called AUSM, for identifying and tracking objects in videos. It aims to be a single system that can handle both situations where you tell it what to look for, and situations where it needs to find everything on its own.
What's the problem?
Currently, video object segmentation is split into different approaches depending on whether you give the system hints (prompts) about what objects are present, or if you want it to find all objects without any help. Existing systems that try to do both often don't work well for both tasks, and require separate, specialized tools. Identifying all objects in a video stream, without being told what to look for, is particularly challenging.
What's the solution?
The researchers treated the problem of tracking objects in a video as predicting what the object outlines (masks) will look like in each frame, similar to how language models predict the next word in a sentence. They built AUSM using a new type of model called a state-space model, which allows it to remember important information about the video without needing a lot of computing power. Importantly, AUSM is designed to be trained very quickly by processing multiple frames at the same time.
Why it matters?
This work is important because it provides a single, unified model that can perform both prompted and unprompted video segmentation effectively. It's also faster to train than previous methods, making it more practical for real-world applications like self-driving cars or video editing where you need to automatically understand what's happening in a video.
Abstract
Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today's landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.