SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. It adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. This allows SeC to achieve high-quality segmentation results while being computationally efficient. SeC has been evaluated on various benchmarks, including the newly introduced Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS), and has demonstrated substantial improvements over state-of-the-art approaches.
SeCVOS is a benchmark designed to challenge models with substantial appearance variations and dynamic scene transformations. It comprises 160 manually annotated multi-scenario videos, which are used to rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding. SeC has achieved an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation. This demonstrates the effectiveness of SeC in handling complex video object segmentation tasks.