EVLM: An Efficient Vision-Language Model for Visual Understanding

Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang

2024-07-22

EVLM: An Efficient Vision-Language Model for Visual Understanding

Summary

This paper presents EVLM, a new vision-language model designed to improve how machines understand images and text together. It aims to make the process more efficient and effective, especially when dealing with long sequences of visual information like videos.

What's the problem?

Current models that combine visual and textual information often struggle with long inputs because they use a single layer of visual features. This can lead to high computational costs and make it hard for the model to fully understand the visual signals. As a result, processing time increases, which is not ideal for applications that require quick responses or real-time analysis.

What's the solution?

The authors propose a more efficient model called EVLM that uses several strategies to enhance performance. First, they implement cross-attention techniques to better integrate image and text data. Second, they utilize hierarchical visual features, which allows the model to capture more detailed information from different layers of the image. Lastly, they introduce a Mixture of Experts (MoE) mechanism that helps the model focus on the most relevant information while reducing unnecessary computations. These changes allow EVLM to perform well on tasks like image and video captioning while keeping computational costs lower.

Why it matters?

This research is important because it addresses key challenges in combining visual and textual data, making it easier for AI systems to understand complex inputs. By improving efficiency and effectiveness in processing images and text together, EVLM has the potential to enhance applications in fields such as automated content creation, video analysis, and interactive AI systems.

Abstract

In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.

View Paper