Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng, Yingyue Li, Haoran Yin, Wenyu Liu, Xinggang Wang

2025-02-19

Multimodal Mamba: Decoder-only Multimodal State Space Model via
Quadratic to Linear Distillation

Summary

This paper talks about mmMamba, a new way to create AI models that can understand both text and images (multimodal models) more efficiently. It uses a technique called state space models to process information faster and with less memory than current methods.

What's the problem?

Current multimodal AI models, which can understand both text and images, are really good at what they do but have some big drawbacks. They use up a lot of computer power, need more and more memory as they process longer inputs, and often need separate systems to handle images. This makes them hard to use in real-world applications where speed and efficiency are important.

What's the solution?

The researchers created mmMamba, which uses a clever method to transform existing multimodal models into more efficient ones. They do this through a process called distillation, where they teach a simpler model to mimic a more complex one. mmMamba can work in two ways: as a fully linear model that's super efficient, or as a hybrid that balances efficiency and performance. They also figured out how to do this without needing special pre-trained models or image processing systems, making it easier for other researchers to use and improve upon.

Why it matters?

This matters because it could make powerful AI that understands both text and images much more practical to use in everyday applications. The mmMamba models can process much longer sequences of information (up to 103,000 tokens) way faster and using much less memory than current models. This could lead to more responsive and capable AI assistants, better image recognition in various fields, and open up new possibilities for AI applications that were previously limited by computational resources.

Abstract

Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate <PRE_TAG>vision encoders</POST_TAG>. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or <PRE_TAG>vision encoders</POST_TAG>. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine <PRE_TAG>Transformer and Mamba layers</POST_TAG> for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mm<PRE_TAG>Mamba-linear</POST_TAG> achieves competitive performance against existing linear and quadratic-complexity VLMs, while mm<PRE_TAG>Mamba-hybrid</POST_TAG> further improves performance significantly, approaching HoVLE's capabilities. At 103K tokens, mm<PRE_TAG>Mamba-linear</POST_TAG> demonstrates 20.6times speedup and 75.8% GPU memory reduction compared to HoVLE, while mm<PRE_TAG>Mamba-hybrid</POST_TAG> achieves 13.5times speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba

View Paper