Law of Vision Representation in MLLMs

Shijia Yang, Bohan Zhai, Quanzeng You, Jianbo Yuan, Hongxia Yang, Chenfeng Xu

2024-08-30

Summary

This paper discusses the 'Law of Vision Representation' in multimodal large language models (MLLMs), highlighting how the alignment between visual and language data affects the performance of these models.

What's the problem?

In multimodal large language models, there is a challenge in ensuring that the visual and language components work well together. This alignment is crucial because poor correspondence can lead to lower performance in tasks that require understanding both text and images.

What's the solution?

The authors introduce a method to measure the alignment between vision representations and language models using a score called the AC score. Through experiments with different settings, they found that a higher AC score directly correlates with better model performance. This allows them to optimize vision representation without needing to retrain the entire language model, significantly reducing computational costs by up to 99.7%.

Why it matters?

Understanding this relationship is important because it can lead to more efficient and effective MLLMs. By improving how these models process and understand both visual and textual information, we can enhance applications in areas like artificial intelligence, robotics, and interactive media.

Abstract

We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

View Paper