EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim

2024-11-27

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Summary

This paper introduces EfficientViM, a new type of neural network designed to work efficiently in environments with limited resources while maintaining high performance in image processing tasks.

What's the problem?

Many existing neural networks require a lot of computational power and memory, making them difficult to use in situations where resources are limited, like on mobile devices or in real-time applications. This can lead to slower performance and less accurate results when analyzing images.

What's the solution?

The authors developed EfficientViM, which uses a special architecture called hidden state mixer-based state space duality (HSM-SSD). This design allows the network to efficiently capture important information from images while using less computational power. They also introduced techniques to improve how the network recognizes and processes images, resulting in better speed and accuracy compared to other models.

Why it matters?

This research is significant because it makes advanced image processing more accessible for devices with limited resources. By improving the efficiency of neural networks, EfficientViM can help enhance applications in areas like mobile photography, augmented reality, and other technologies where quick and accurate image analysis is crucial.

Abstract

For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens. Yet, efficient vision backbones built with SSM have been explored less. In this paper, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states. Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at https://github.com/mlvlab/EfficientViM.

View Paper