Intriguing Properties of Large Language and Vision Models

Young-Jun Lee, Byungsoo Ko, Han-Gyu Kim, Yechan Hwang, Ho-Jin Choi

2024-10-13

Summary

This paper discusses the intriguing properties of large language and vision models (LLVMs) and investigates how well they perform on various tasks that require understanding images and text.

What's the problem?

While LLVMs have shown impressive abilities in complex reasoning tasks, their performance on basic perception tasks, like recognizing objects in images, is surprisingly low. This raises questions about how these models actually perceive images and whether they are effectively using their vision components.

What's the solution?

To explore this, the authors conducted a thorough evaluation of several popular LLVMs across ten different benchmarks. They looked into various aspects such as how these models handle the order of visual information, their ability to solve math problems without detailed numerical data, and how well they maintain their original perceptual skills when faced with complex reasoning tasks. Their findings revealed that LLVMs process images globally, can sometimes answer math questions without fully understanding the details, and that the lower layers of the model play a crucial role in visual understanding.

Why it matters?

This research is important because it highlights both the strengths and weaknesses of current LLVMs in handling visual information. By understanding these properties, researchers can work towards improving these models, leading to better performance in real-world applications like image recognition, automated reasoning, and more effective AI systems overall.

Abstract

Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

View Paper