EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang
2025-02-11
Summary
This paper talks about EVEv2, an improved version of a system that helps computers understand both images and text together without using complex encoders. It's like teaching a computer to see and read at the same time, but in a simpler way than before.
What's the problem?
Current systems that help computers understand images and text together (called vision-language models or VLMs) often use complicated parts called encoders. These make the systems harder to use and less efficient. Some newer systems try to work without these encoders, but they haven't been as good as the ones with encoders.
What's the solution?
The researchers created EVEv2, which improves on earlier attempts to make VLMs work without encoders. They found ways to make the system understand images and text together more effectively, like separating how it processes visual and language information at first and then combining them carefully. They also came up with better training methods to help the system learn more efficiently.
Why it matters?
This matters because it could make AI systems that work with both images and text much simpler and more efficient to use. This could lead to better and faster AI tools for things like describing images, answering questions about pictures, or even helping visually impaired people understand their surroundings. By making these systems simpler, they could also be used in more places, like on smartphones or other devices with limited computing power.
Abstract
Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.