Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu
2024-10-10

Summary
This paper introduces the Modality Integration Rate (MIR), a new metric designed to evaluate the quality of pre-training in Large Vision-Language Models (LVLMs), which combine visual and textual information.
What's the problem?
Evaluating how well LVLMs are trained is challenging because existing metrics, like loss and perplexity, do not accurately reflect how well these models can align text with new visual information. This makes it hard for researchers to improve LVLMs and choose the right training data and strategies.
What's the solution?
The authors propose MIR as a better way to measure pre-training quality by looking at the differences between how well the model understands various types of data. MIR is effective, robust, and generalizes well across different training setups. The researchers conducted experiments to show that MIR can help identify the best training data and strategies, leading to improved performance in LVLMs after fine-tuning.
Why it matters?
This research is important because it provides a new tool for evaluating and improving LVLMs, which are crucial for applications that require understanding both images and text. By using MIR, researchers can make better decisions about training methods, ultimately leading to more capable AI systems that can handle complex tasks involving multiple types of information.
Abstract
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) Effective to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) Robust toward different training/evaluation data. 3) Generalize across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: https://github.com/shikiw/Modality-Integration-Rate.