Taming Modality Entanglement in Continual Audio-Visual Segmentation
Yuyang Hong, Qi Yang, Tao Zhang, Zili Wang, Zhaojin Fu, Kun Ding, Bin Fan, Shiming Xiang
2025-10-27
Summary
This paper tackles the problem of teaching a computer to continuously learn to identify objects in videos using both what it sees and what it hears, without forgetting what it learned before.
What's the problem?
Current methods for continual learning, where a computer learns tasks one after another, struggle when dealing with detailed audio-visual information. Specifically, the system can get confused about what sounds mean in new situations – for example, labeling a sound as background noise when it actually represents an object. It also has trouble distinguishing between objects that often appear together, leading to misidentification.
What's the solution?
The researchers developed a new learning framework called Collision-based Multi-modal Rehearsal (CMR). This framework addresses the confusion by carefully selecting past examples to 'rehearse' during learning. It prioritizes examples where the audio and visual information strongly match, reducing semantic drift. It also increases the frequency of rehearsing examples of classes that are often confused with each other, helping the system learn to tell them apart.
Why it matters?
This work is important because it improves the ability of computers to learn continuously from both audio and video, which is crucial for applications like robotics, video surveillance, and creating more intelligent assistants. By overcoming the challenges of modality entanglement and co-occurrence confusion, the system becomes more reliable and accurate in real-world scenarios.
Abstract
Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.