VSSD: Vision Mamba with Non-Casual State Space Duality

Yuheng Shi, Minjing Dong, Mingjia Li, Chang Xu

2024-07-29

VSSD: Vision Mamba with Non-Casual State Space Duality

Summary

This paper introduces VSSD, a new model called Vision Mamba with Non-Causal State Space Duality, which improves how computers understand images by allowing them to process information more flexibly and efficiently.

What's the problem?

Vision transformers have made great strides in image processing, but they require a lot of computing power and often struggle with long sequences of data. Traditional models have a causal nature, meaning they rely on previous information to understand new data, which can limit their effectiveness in handling complex visual tasks.

What's the solution?

To solve these issues, the authors developed VSSD, which uses a non-causal approach. This means it can consider multiple pieces of information at once without depending strictly on the order of the data. They achieved this by adjusting how the model interacts with the data and incorporating multi-scan strategies that allow it to gather information more effectively. Extensive experiments showed that VSSD outperformed existing models in various tasks like image classification and object detection while being more efficient.

Why it matters?

This research is important because it enhances the capabilities of computer vision models, making them better at understanding images and scenes. By improving how these models work, VSSD can lead to advancements in areas like self-driving cars, robotics, and any technology that relies on visual understanding, ultimately making these systems smarter and more reliable.

Abstract

Vision transformers have significantly advanced the field of computer vision, offering robust modeling capabilities and global receptive field. However, their high computational demands limit their applicability in processing long sequences. To tackle this issue, State Space Models (SSMs) have gained prominence in vision tasks as they offer linear computational complexity. Recently, State Space Duality (SSD), an improved variant of SSMs, was introduced in Mamba2 to enhance model performance and efficiency. However, the inherent causal nature of SSD/SSMs restricts their applications in non-causal vision tasks. To address this limitation, we introduce Visual State Space Duality (VSSD) model, which has a non-causal format of SSD. Specifically, we propose to discard the magnitude of interactions between the hidden state and tokens while preserving their relative weights, which relieves the dependencies of token contribution on previous tokens. Together with the involvement of multi-scan strategies, we show that the scanning results can be integrated to achieve non-causality, which not only improves the performance of SSD in vision tasks but also enhances its efficiency. We conduct extensive experiments on various benchmarks including image classification, detection, and segmentation, where VSSD surpasses existing state-of-the-art SSM-based models. Code and weights are available at https://github.com/YuHengsss/VSSD.

View Paper