Building upon VideoMaMa's strong capabilities, the framework introduces a scalable pseudo-labeling pipeline that generates high-quality matting annotations automatically from accessible segmentation cues. This pipeline facilitates the creation of the Matting Anything in Video (MA-V) dataset, comprising over 50,000 real-world videos annotated with pixel-accurate alpha mattes, covering a broad spectrum of everyday scenarios, dynamic movements, and environmental variations. By democratizing access to large-scale training data, VideoMaMa paves the way for advancing video editing tools, compositing workflows, and augmented reality applications that demand seamless foreground-background separation.
To demonstrate practical impact, VideoMaMa fine-tunes the Segment Anything Model 2 (SAM2) on the MA-V dataset, resulting in SAM2-Matte, which exhibits superior robustness and accuracy on unseen in-the-wild videos compared to models trained on prior matting datasets. The architecture integrates mask-guided processing with diffusion-based refinement, ensuring temporal consistency and fine-grained detail preservation across video frames. All models, code, and the comprehensive MA-V dataset are set for public release, empowering researchers and developers to push the boundaries of generative video processing and scalable annotation strategies.


