On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong

2026-03-31

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

Summary

This paper focuses on improving how large AI models that understand both images and text, called Large Vision Language Models (LVLMs), learn new things over time without forgetting what they already know. It's about 'continual learning' in the world of AI vision and language.

What's the problem?

When you try to teach these models new skills continuously, they tend to 'forget' older skills. One way to help them learn continuously is to add new 'experts' to the model, but even with these experts working somewhat independently, the model still gets confused. Specifically, the way the model decides which expert should handle a piece of information ('routing') drifts over time, causing the model to incorrectly send information meant for old tasks to the new experts, leading to forgetting. The problem is that certain types of information, especially ambiguous or older data, don't help the new experts learn much but *do* cause this harmful routing drift.

What's the solution?

The researchers developed a new system called LLaVA-DyMoE that addresses this 'routing drift'. It carefully manages how new experts are added and how information is sent to them. It analyzes how confidently the model routes different types of information and then gently guides ambiguous or older information *away* from the new experts. This prevents the new experts from getting confused and messing up the model's ability to handle older tasks. They also encourage the new experts to specialize and stay separate from the existing ones.

Why it matters?

This research is important because it makes continual learning more effective for these powerful vision and language models. By reducing forgetting, we can build AI systems that constantly improve and adapt to new information without losing their previous knowledge, making them more reliable and useful in real-world applications.

Abstract

Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.

View Paper