Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, Yubo Zhu, Qianyu Li, Di Yin, Haoyu Cao, Weibo Gu, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Mingkong Tang, Shuangyin Liu

2026-01-28

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

Summary

This paper introduces a new way to train Vision-Language Models (VLMs), which are AI systems that can understand both images and text. The researchers found that current VLMs are better at processing text than images, and their new method aims to fix that.

What's the problem?

Existing VLMs, while good at generally understanding images and text together, struggle to really *see* the details in images. They tend to focus more on the text part of the input and treat the image as just something to provide extra information, rather than something equally important to analyze. This leads to a less complete understanding of what's happening in an image.

What's the solution?

The researchers developed a framework called Youtu-VL that changes how these models are trained. Instead of just using the image to help predict the text, Youtu-VL also uses the text to help predict parts of the image. This 'vision-as-target' approach forces the model to pay closer attention to the visual details. They also made it so the model can handle tasks that focus *only* on images, without needing extra training for those specific tasks.

Why it matters?

This work is important because it improves the ability of AI to truly understand images, not just recognize objects in them. By creating a more balanced system that values both visual and textual information, it paves the way for more capable and versatile AI agents that can handle a wider range of tasks involving both images and language.

Abstract

Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.

View Paper