MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Ali Hatamizadeh, Jan Kautz

2024-07-13

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Summary

This paper introduces MambaVision, a new type of model that combines Mamba and Transformer architectures to improve how computers understand images. It's designed specifically for tasks in computer vision, like recognizing objects and classifying images.

What's the problem?

Many existing models struggle with understanding the details in images because they either focus too much on local features (small parts of an image) or fail to capture the relationships between different parts of an image over longer distances. This makes it hard for them to perform well in complex visual tasks.

What's the solution?

MambaVision addresses these issues by redesigning the Mamba model to work better with visual data and by adding self-attention blocks from Transformer models. This combination allows the model to understand both local details and long-range relationships in images more effectively. The researchers tested MambaVision on various tasks, including image classification and object detection, and found that it outperformed other models, achieving state-of-the-art results on several benchmarks.

Why it matters?

This research is important because it represents a significant advancement in how AI models process visual information. By improving the ability of models to understand images more accurately and efficiently, MambaVision can enhance applications in areas like robotics, autonomous vehicles, and medical imaging, making these technologies more effective.

Abstract

We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self-attention blocks at the final layers greatly improves the modeling capacity to capture long-range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For Image classification on ImageNet-1K dataset, MambaVision model variants achieve a new State-of-the-Art (SOTA) performance in terms of Top-1 accuracy and image throughput. In downstream tasks such as object detection, instance segmentation and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably-sized backbones and demonstrates more favorable performance. Code: https://github.com/NVlabs/MambaVision.

View Paper