Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim

2025-09-10

Visual Representation Alignment for Multimodal Large Language Models

Summary

This paper focuses on improving how well large AI models that can 'see' and 'understand' language perform on tasks that really require visual thinking, like counting objects or understanding how things are positioned in a picture.

What's the problem?

Current AI models that combine vision and language are pretty good overall, but they struggle with tasks that heavily rely on detailed visual understanding. The researchers believe this is because these models are mostly trained using text instructions, which doesn't directly teach them to pay attention to and remember important visual details. Essentially, the visual part of the AI gets ignored during training because it's not explicitly encouraged.

What's the solution?

The researchers developed a technique called VIRAL, which stands for Visual Representation ALignment. It works by making sure the way the AI 'sees' things internally matches how well-established, pre-trained vision models 'see' things. Think of it like giving the AI a reference point for what's important in an image. This helps the AI hold onto those crucial visual details and also learn extra visual knowledge from the existing vision models, making it better at complex visual reasoning.

Why it matters?

This research is important because it shows a simple way to significantly improve the visual abilities of these powerful AI models. It suggests that directly focusing on aligning visual understanding, rather than just relying on text instructions, is key to building AI that can truly 'see' and reason about the world around it, opening doors for more advanced applications.

Abstract

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

View Paper