InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu

2026-03-11

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Summary

This paper introduces InternVL-U, a new type of artificial intelligence model that can both understand information from different sources, like text and images, and then create new content based on that understanding.

What's the problem?

Typically, AI models that are good at understanding things aren't also great at *creating* things, and vice versa. Building a single model that excels at both understanding and generation is difficult because there's a trade-off – improving one often hurts the other. Existing powerful models are also often very large and require a lot of computing power, making them inaccessible to many researchers.

What's the solution?

The researchers created InternVL-U, a relatively small (4 billion parameters) model that’s designed to be good at both understanding and generating. They did this by focusing on how the model processes information from different sources and by building separate parts of the model specifically for visual tasks. They also created a new way to train the model using complex reasoning tasks, like understanding scientific concepts and rendering text into images, which helps it better follow instructions and create detailed visuals. This training method uses a 'Chain-of-Thought' approach, meaning the model breaks down complex tasks into smaller, more manageable steps.

Why it matters?

InternVL-U is important because it shows you can build a powerful, all-in-one AI model without needing a massive amount of computing resources. It actually performs *better* than much larger models on many tasks, making advanced AI capabilities more accessible to a wider range of people and potentially speeding up progress in the field.

Abstract

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.

View Paper