Scaling Language-Centric Omnimodal Representation Learning

Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong

2025-10-15

Scaling Language-Centric Omnimodal Representation Learning

Summary

This research investigates why newer methods that combine images and text using powerful AI models, specifically those trained with a technique called contrastive learning, work so well. It turns out a lot of the success comes from how these models are initially trained to *create* text from images and text together, which naturally aligns how they understand different types of information.

What's the problem?

While these combined image and text AI models are performing better, it wasn't clear *why* they were better. Previous work just showed *that* they worked, but didn't explain the underlying reason for their improvement. Researchers wanted to understand what made these models so effective at connecting information from different sources, like images and text.

What's the solution?

The researchers found that the initial training process, where the model learns to generate text based on both images and text, creates a natural connection between how the model represents these different types of data. They then built a new framework called LCO-Emb that takes advantage of this pre-existing connection, using contrastive learning as a final polishing step. They also discovered a pattern – the better the model is at generating text, the better its internal representations of images and text become, and they provided a mathematical explanation for this.

Why it matters?

This work is important because it provides a fundamental understanding of why these new AI models are so good at combining information. By understanding this, we can build even better models in the future. The discovery of the 'Generation-Representation Scaling Law' suggests that focusing on improving a model’s ability to *create* content is a powerful way to also improve its ability to *understand* and compare different types of data, which has broad implications for many AI applications.

Abstract

Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.

View Paper