Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai
2025-07-21
Summary
This paper talks about Mono-InternVL, a new type of multimodal large language model that combines visual understanding and language processing into one single, unified model to improve performance and efficiency.
What's the problem?
The problem is that many existing multimodal models use separate parts for processing images and text, which can be inefficient and lead to unstable learning or forgetting important information when training the model.
What's the solution?
The authors designed Mono-InternVL to embed visual expertise directly into a pre-trained language model using a multimodal mixture-of-experts structure and introduced a special learning method called Endogenous Visual Pre-training to gradually teach visual knowledge in a stable way, reducing training costs and speeding up inference.
Why it matters?
This matters because it creates faster, cheaper, and more effective AI systems that can understand both images and text better than before, which is useful for applications like image recognition, captioning, and natural language understanding.
Abstract
Mono-InternVL, an advanced monolithic Multimodal Large Language Model, integrates visual experts and employs Endogenous Visual Pre-training to enhance visual learning and reduce computational costs.