Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu
2025-01-28
Summary
This paper talks about a new AI model called Mixture-of-Mamba, which improves on existing State Space Models (SSMs) by making them better at handling different types of data (like text, images, and speech) at the same time. It does this by introducing 'modality-aware sparsity,' which allows the model to process each type of data more efficiently.
What's the problem?
State Space Models are good at processing sequences of data efficiently, but they struggle when dealing with multiple types of data at once (like combining text and images). This limits their usefulness in tasks that require understanding different kinds of information together.
What's the solution?
The researchers created Mixture-of-Mamba, which adapts the SSM architecture to handle multiple types of data more effectively. They tested it on three different setups that combine text with images or speech. The new model was able to achieve the same quality of results as previous models but using much less computational power - in some cases, using less than half the resources.
Why it matters?
This research matters because it shows a way to make AI models that can understand multiple types of information (like text, images, and speech) more efficiently. This could lead to more powerful and versatile AI systems that use less energy and computing resources. It's particularly important for developing AI that can interact with the world in more human-like ways, understanding both what it sees and hears. The efficiency gains could also make it easier and cheaper to run these complex AI models, potentially making advanced AI more accessible for various applications.
Abstract
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba