MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou

2025-07-02

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional
Multimodal Embeddings

Summary

This paper talks about MoCa, a new two-step training method that improves vision-language models to create better embeddings, which are ways for computers to understand and connect images and text together in a balanced way.

What's the problem?

The problem is that current vision-language models often use one-way attention, which limits their ability to fully understand how images and text relate to each other. They also rely too much on limited, labeled data, and don’t use diverse training methods, which hurts their performance and generalization.

What's the solution?

The researchers developed MoCa, which first uses modality-aware continual pre-training to teach the model to reconstruct and denoise mixed images and text, helping it learn two-way attention between the two types of data. Then, they apply heterogeneous contrastive fine-tuning with diverse multimodal data to make the embeddings stronger and more versatile.

Why it matters?

This matters because better bidirectional multimodal embeddings help AI systems understand and connect images and text more effectively, improving performance in a wide range of tasks like search, recommendation, and content understanding.

Abstract

MoCa, a two-stage framework, enhances pre-trained VLMs into effective bidirectional multimodal embedding models by addressing limitations through modality-aware continual pre-training and heterogeneous contrastive fine-tuning.

View Paper