Unveiling Encoder-Free Vision-Language Models

Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang

2024-07-08

Unveiling Encoder-Free Vision-Language Models

Summary

This paper talks about a new type of vision-language model called EVE that does not use traditional vision encoders to process images. Instead, it combines visual and language inputs directly, making it more flexible and efficient.

What's the problem?

The main problem is that existing vision-language models typically rely on vision encoders to extract features from images before using large language models to handle the text. This approach can limit the model's flexibility and efficiency because the encoders impose specific biases based on how they process visual information. Additionally, training models without these encoders has been difficult and often leads to slower learning and poorer performance.

What's the solution?

To solve this issue, the authors developed EVE, an encoder-free vision-language model. They introduced a new training method that allows the model to seamlessly integrate visual and language inputs through a single decoder. They also enhanced the model's ability to recognize visual information by adding extra supervision during training. With these improvements, EVE can be trained efficiently using publicly available data and can compete with traditional encoder-based models in various tasks.

Why it matters?

This research is important because it opens up new possibilities for building more efficient and adaptable AI models that can understand both images and text without being limited by traditional methods. By demonstrating that an encoder-free approach can work effectively, EVE paves the way for future developments in vision-language technology, which could improve applications in areas like robotics, content creation, and human-computer interaction.

Abstract

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing a pure decoder-only architecture across modalities. Our code and models are publicly available at: https://github.com/baaivision/EVE.

View Paper