NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Changyao Tian, Hao Li, Gen Luo, Xizhou Zhu, Weijie Su, Hanming Deng, Jinguo Zhu, Jie Shao, Ziran Zhu, Yunpeng Liu, Lewei Lu, Wenhai Wang, Hongsheng Li, Jifeng Dai

2025-10-10

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Summary

This paper investigates a new way to build powerful AI models that can understand both images and text, called Multimodal Large Language Models (MLLMs). It focuses on training these models from scratch, rather than connecting pre-existing image and text components.

What's the problem?

Currently, most MLLMs are built by taking a model good at understanding images and connecting it to a model good at understanding text. While this works, it makes it hard to improve these models significantly as they get bigger because the image and text parts are trained separately. It's difficult to figure out how to best scale up both parts together to get the biggest performance boost.

What's the solution?

The researchers decided to train an MLLM completely from the beginning, with both the image and text understanding parts learning together. They experimented with different ways to design this model, figuring out the best balance between how well it performs and how much it costs to train. They found that increasing the size of both the image and text components generally leads to better results, and they built a new model called NaViL based on these findings. They then tested NaViL on a variety of tasks to show it performs competitively with other MLLMs.

Why it matters?

This work is important because it provides a better understanding of how to build and scale MLLMs effectively. By training from scratch, the researchers gained insights into how the image and text parts interact and how to optimize them for maximum performance. This knowledge will help future researchers create even more powerful and capable AI models that can seamlessly process both visual and textual information.

Abstract

Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.

View Paper