MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Peng Xia, Siwei Han, Shi Qiu, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang, Huaxiu Yao

2024-10-15

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Summary

This paper introduces MMIE, a new benchmark designed to evaluate how well large vision-language models (LVLMs) can understand and generate content that combines text and images in various ways.

What's the problem?

As AI technology advances, it's becoming increasingly important to assess how well models can interpret and create content that involves both text and images together. However, current benchmarks for testing these abilities are limited in size and depth, making it hard to accurately measure performance. Additionally, existing evaluation methods can be biased or unreliable.

What's the solution?

MMIE addresses these issues by providing a large-scale benchmark with 20,000 carefully curated multimodal queries across different categories and fields, such as mathematics, coding, and arts. It includes a mix of question types (like multiple-choice and open-ended) to thoroughly test the models' capabilities. The paper also introduces a new automated scoring system that reduces bias and improves accuracy in evaluating the models' performance.

Why it matters?

This research is significant because it sets a new standard for evaluating how well AI models can handle complex tasks that involve both text and images. By improving the way we assess these models, MMIE can help drive advancements in AI technology, making it more effective for real-world applications where understanding multimodal content is crucial.

Abstract

Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.

View Paper