MIBench: Evaluating Multimodal Large Language Models over Multiple Images
Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu
2024-07-23

Summary
This paper introduces MIBench, a new benchmark designed to evaluate how well multimodal large language models (MLLMs) perform when working with multiple images. It aims to fill the gap in existing evaluations that mostly focus on single images.
What's the problem?
Most benchmarks for MLLMs have focused on tasks involving only one image at a time, which doesn't reflect real-world scenarios where models often need to analyze and understand multiple images together. This limitation means we don't fully understand how well these models can perform in more complex situations, such as when they need to reason about relationships between several images or follow instructions that involve multiple visuals.
What's the solution?
To address this issue, the authors of the paper created MIBench, which includes a variety of tasks specifically designed for multi-image scenarios. They categorized these tasks into three main areas: multi-image instruction (MII), multimodal knowledge-seeking (MKS), and multimodal in-context learning (MIC). MIBench consists of 13 tasks with a total of 13,000 annotated samples, allowing for a detailed evaluation of how well MLLMs handle multiple images. The authors tested several existing MLLMs using this benchmark and found that while these models perform well with single images, they struggle with multi-image inputs.
Why it matters?
This research is important because it provides a comprehensive way to assess the capabilities of MLLMs in handling more realistic and complex tasks. By highlighting the shortcomings of current models when faced with multiple images, MIBench can guide future improvements in AI technology, making it more effective for applications like image captioning, visual question answering, and other tasks that require understanding multiple visuals.
Abstract
Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks across multiple benchmarks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. Therefore, in this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source MLLMs and close-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as confused fine-grained perception, limited multi-image reasoning, and unstable in-context learning. The annotated data in MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench.