Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu, Yi Xin, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Xiaoxing Hu, Xinlei Yu, Ziyue Qiao, Hao Tang, Xue Yang, Xiaobin Hu, Chengwei Qin, Hui Xiong, Yu Qiao, Shuicheng Yan

2026-02-10

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Summary

This paper tackles a problem in how computers understand images and text together, specifically how they represent the meaning of both in a way that aligns. Current methods struggle with a 'gap' between how images and text are represented, even when they describe the same thing.

What's the problem?

When computers process images and text, they create numerical representations of their meaning. Ideally, if an image and a text description have the same meaning, their representations should be close together. However, there's a consistent difference – a 'modality gap' – where these representations are offset from each other, even when the content is the same. Previous attempts to fix this gap often assume the problem is simple and uniform, which doesn't work well when dealing with large amounts of data and complex scenarios.

What's the solution?

The researchers discovered the 'modality gap' isn't random; it has a specific shape. They developed a method called ReAlign that doesn't require any training, instead using statistical information from large collections of images and text to shift the text representations closer to the image representations. Then, they created ReVision, a way to train large language models that can understand both images and text, by first teaching the model about images using ReAlign before fine-tuning it with paired image-text examples.

Why it matters?

This work is important because it offers a more efficient way to build powerful AI systems that can understand both images and text. By cleverly using existing, unpaired data, it reduces the need for expensive and time-consuming collections of perfectly matched image-text pairs, making it easier to scale up these models and improve their performance.

Abstract

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

View Paper