MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linfeng Ren, Linjie Li, Jianfeng Wang, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang, Xinchao Wang
2024-08-02

Summary
This paper introduces MM-Vet v2, an updated benchmark designed to evaluate large multimodal models (LMMs) on their ability to understand and process information from both images and text together. It aims to improve how these models are tested in real-world scenarios.
What's the problem?
Previous versions of the MM-Vet benchmark focused on single image-text pairs, which limited the evaluation of models in more complex situations where images and text interact more dynamically. This lack of interactivity made it difficult to assess how well models could handle real-world tasks that involve both visual and textual information.
What's the solution?
To address this issue, the authors developed MM-Vet v2, which includes a new capability called 'image-text sequence understanding.' This allows models to process sequences where images and text are interleaved, reflecting how we often encounter information in everyday life. The benchmark also expands the evaluation set size while maintaining high-quality samples, ensuring that models are tested rigorously across various tasks.
Why it matters?
This research is significant because it enhances the evaluation of multimodal models, helping developers understand their strengths and weaknesses better. By providing a more comprehensive testing framework, MM-Vet v2 can lead to improvements in AI systems that need to integrate visual and textual data, making them more effective for applications like virtual assistants, educational tools, and content generation.
Abstract
MM-Vet, with open-ended vision-language questions targeting at evaluating integrated capabilities, has become one of the most popular benchmarks for large multimodal model evaluation. MM-Vet assesses six core vision-language (VL) capabilities: recognition, knowledge, spatial awareness, language generation, OCR, and math. However, its question format is restricted to single image-text pairs, lacking the interleaved image and text sequences prevalent in real-world scenarios. To address this limitation, we introduce MM-Vet v2, which includes a new VL capability called "image-text sequence understanding", evaluating models' ability to process VL sequences. Furthermore, we maintain the high quality of evaluation samples while further expanding the evaluation set size. Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0. Among open-weight models, InternVL2-Llama3-76B leads with a score of 68.4.