Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models
Xiwen Wei, Mustafa Munir, Radu Marculescu
2025-12-05
Summary
This research focuses on improving how AI systems that can both understand and create images learn new things over time without forgetting what they already know. These systems, called Unified Multimodal Generative Models, are powerful but struggle when asked to continually learn new tasks.
What's the problem?
When these AI models learn a new skill, they tend to forget previously learned skills. This happens both within a single type of information, like images, and *between* different types of information, like images and text. The forgetting between different types of information – for example, forgetting how to understand images while learning to generate text – is a particularly big problem that hasn't been well-studied. The core issue is that updating the model for one type of information interferes with the knowledge it has about other types of information, creating conflicting updates.
What's the solution?
The researchers developed a new system called Modality-Decoupled Experts, or MoDE. This system works by keeping the parts of the AI that handle different types of information – like images and text – separate during the learning process. This prevents the updates for one type of information from messing up the knowledge of the others. They also use a technique called knowledge distillation, which helps the model retain its existing skills while learning new ones. Essentially, MoDE isolates learning so that one skill doesn't overwrite another.
Why it matters?
This work is important because it allows AI systems that work with multiple types of information, like images and text, to continually learn and improve without constantly forgetting what they’ve already learned. This is a crucial step towards building more robust and adaptable AI that can handle real-world tasks which often involve understanding and generating different kinds of data.
Abstract
Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: https://github.com/Christina200/MoDE-official.git