MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, Jiebo Luo
2024-10-15

Summary
This paper talks about MMCOMPOSITION, a new benchmark designed to evaluate how well Vision-Language Models (VLMs) can understand and combine visual and textual information in creative ways.
What's the problem?
Although VLMs have made great strides in understanding images and text together, researchers have not fully explored their ability to create new combinations of these elements. Existing evaluations only look at basic aspects like objects and attributes but ignore more complex interactions and reasoning, which are essential for truly understanding how to combine information from different sources.
What's the solution?
To address this gap, the authors created MMCOMPOSITION, a benchmark that includes human-annotated tasks specifically aimed at testing the compositionality of VLMs. This benchmark allows researchers to measure how well models can understand complex relationships between objects, count items, and handle intricate compositions. The study found that even advanced models like GPT-4o performed worse than some open-source alternatives, highlighting areas where these models need improvement.
Why it matters?
This research is important because it helps improve the design and training of VLMs by providing a clearer understanding of their strengths and weaknesses in compositional reasoning. By focusing on how well these models can combine visual and textual information in sophisticated ways, MMCOMPOSITION can lead to better AI systems that are more capable of handling real-world tasks that require deep understanding.
Abstract
The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs' superior capabilities, researchers lack a comprehensive understanding of their compositionality -- the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs' compositionality. Our proposed benchmark serves as a complement to these earlier works. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training. Resources available at: https://hanghuacs.github.io/MMComposition/