From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu

2025-10-17

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Summary

This paper investigates a new type of vision-language model, called a 'native VLM', which combines vision and language processing into a single system instead of using separate parts. The researchers wanted to understand the challenges of building these native models and make it easier for others to work on them.

What's the problem?

Currently, building these native VLMs is difficult because it's unclear what the core building blocks should be and how to best combine vision and language information. Existing models often struggle to truly integrate these two types of data, and research in this area isn't easily accessible to everyone, slowing down progress. Essentially, it's hard to know how to design these models from scratch and share improvements efficiently.

What's the solution?

The researchers propose a set of principles for building native VLMs, focusing on creating a system where visual and textual information are understood in the same way, the strengths of both vision and language are combined seamlessly, and the model naturally understands the relationship between images and text. They then built a new family of models called NEO, using these principles, and showed that it performs as well as more complex, traditional models, even with less training data.

Why it matters?

This work is important because it provides a clear path forward for developing more powerful and efficient vision-language models. By offering a simplified design and making their code publicly available, the researchers aim to encourage more people to contribute to this field, ultimately leading to faster advancements in how computers understand both images and language.

Abstract

The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

View Paper