< Explain other AI papers

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian

2025-08-26

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Summary

This paper introduces InternVL 3.5, a new and improved version of a multimodal AI model, meaning it can understand both images and text. It's designed to be more versatile, better at reasoning, and faster at processing information than previous versions.

What's the problem?

Existing multimodal AI models often struggle with complex reasoning tasks that require understanding the relationships between visual and textual information. They can also be slow and inefficient, making them difficult to use in real-time applications. Furthermore, open-source models often lag behind the performance of those developed by large companies.

What's the solution?

The researchers tackled these problems with two main innovations. First, they developed a 'Cascade Reinforcement Learning' method, which trains the model in two steps – first to get a stable understanding, then to fine-tune its reasoning abilities. Second, they created a 'Visual Resolution Router' that smartly adjusts the detail of images the model processes, speeding things up without losing accuracy. They also split the image and text processing onto different computer processors to further improve efficiency.

Why it matters?

InternVL 3.5 represents a significant step forward in open-source multimodal AI. It achieves a substantial performance boost in reasoning and speed compared to its predecessors, and even starts to close the gap with powerful commercial models like GPT-5. The model’s new capabilities, like interacting with computer interfaces and acting as an 'agent,' open up possibilities for more advanced and practical AI applications, and because it's open-source, other researchers and developers can build upon this work.

Abstract

We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05times inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.