PairUni: Pairwise Training for Unified Multimodal Language Models

Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian, Kunpeng Qiu, Ye Tian, Haochen Wang, Zhuochen Wang

2025-10-30

PairUni: Pairwise Training for Unified Multimodal Language Models

Summary

This paper introduces a new method, called PairUni, for training advanced AI models that can both understand images and generate text based on them. These models, known as unified vision-language models, are tricky to train because the skills needed for understanding and generating are quite different.

What's the problem?

Unified vision-language models need to be good at two very different things: figuring out what’s in an image (understanding) and creating descriptions or answering questions about it (generation). Training these models using reinforcement learning is hard because it’s difficult to balance learning both skills at the same time. The data used for each skill is also different, making it hard for the model to connect the two.

What's the solution?

The researchers tackled this by creating pairs of related data. They used a powerful language model to create captions for images and questions for existing answers. They also found similar images and text to link together, even if they weren't originally paired. This pairing helps the model see how understanding and generation relate to each other. They then developed a new training method, Pair-GPRO, that focuses on learning from the well-matched pairs and reducing confusion between the two tasks. They also created a dataset of 16,000 of these paired examples to help with training.

Why it matters?

This work is important because it improves the performance of these unified vision-language models, making them better at both understanding and generating content. This means AI systems could become more capable of tasks like describing images accurately, answering complex questions about visuals, and generally interacting with the world in a more human-like way. It provides a more balanced and effective way to train these complex AI systems.

Abstract

Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: https://github.com/Haochen-Wang409/PairUni{github.com/Haochen-Wang409/PairUni}

View Paper