OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang

2025-03-12

OmniMamba: Efficient and Unified Multimodal Understanding and Generation
via State Space Models

Summary

This paper talks about OmniMamba, an AI that can create and understand both text and images efficiently by using a smarter, faster model design instead of bulky traditional systems.

What's the problem?

Current AI models for handling text and images together are slow, need huge computers, and require millions of examples to learn properly.

What's the solution?

OmniMamba uses a streamlined Mamba-2 model to process text and images quickly, adds special guides to handle each task separately, and trains in two stages to balance learning without needing tons of data.

Why it matters?

This makes AI tools for art, design, or content creation faster and cheaper to run, helping creators work with text and visuals without needing supercomputers.

Abstract

Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at https://github.com/hustvl/OmniMamba

View Paper