MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

Xin Jin, Siyuan Li, Siyong Jian, Kai Yu, Huan Wang

2025-10-28

MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

Summary

This paper introduces a new method called MergeMix to improve how well images and language understand each other in large AI models that process both, known as multi-modal large language models (MLLMs).

What's the problem?

Currently, getting these models to accurately connect what they 'see' in images with what they 'say' in language is tricky. The two main approaches, supervised fine-tuning and reinforcement learning, both have drawbacks. Supervised fine-tuning needs a lot of labeled data created by people, and it struggles with understanding subtle preferences. Reinforcement learning is unstable and computationally expensive. Essentially, there's a trade-off between how well the model performs, how easily it can be scaled up, and how reliable it is.

What's the solution?

MergeMix tries to get the best of both worlds. It works by cleverly combining images during training, focusing on important parts of the image using something called 'attention'. It then creates pairs of images – one original and one mixed – and trains the model to prefer the mixed images. This is done using a specific training technique called SimPO loss. By mixing images, the model learns to focus on consistent and important features, making it more efficient and accurate.

Why it matters?

This research is important because it offers a more practical way to align images and language in these powerful AI models. MergeMix is more scalable and efficient than existing methods, meaning it can be used to train larger and more capable models without requiring massive amounts of human-labeled data or facing instability issues. This advancement helps improve the accuracy and reliability of MLLMs in tasks like image classification and understanding complex visual scenes.

Abstract

Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.

View Paper