InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, Dengnian Chen, Songze Li, Yinan He, Tan Jiang

2025-04-15

InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models

Summary

This paper talks about InternVL3, a new open-source AI model that can understand and work with both images and text at the same time. The model is trained using advanced methods that help it learn more efficiently from different types of data, making it better at tasks where understanding both pictures and words is important.

What's the problem?

The problem is that most AI models are usually good at either text or images, but struggle when they need to handle both together, especially when it comes to learning from large and varied datasets. This makes it hard for current models to perform well on complex tasks that require understanding information from multiple sources at once.

What's the solution?

The researchers developed InternVL3 by combining new training strategies and smart techniques that allow the model to learn from both text and images at the same time. They also introduced improvements for how the model works during testing, so it can handle bigger and more challenging tasks. These changes helped InternVL3 reach new top scores on a wide range of multimodal benchmarks.

Why it matters?

This work matters because it pushes the boundaries of what open-source AI can do, making it easier for anyone to use powerful models that understand both images and text. With InternVL3, more people can build smarter apps, search engines, and creative tools that work with all kinds of information, not just one type.

Abstract

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

View Paper